XML Configuration Reference : Index : ConvertTextExtractor
 
ConvertTextExtractor
com.exalead.indexing.analysis.v10.ConvertTextExtractor
This processor performs text content extraction for all MIME-types (300+ file formats are currently handled). See the "Supported Formats" technical note for more information. Text, HTML, and built-in data types must be processed by the 'NativeTextExtractor' rather than this processor. Make sure to have a 'NativeTextExtractor' before the ConvertTextExtractor in your pipeline. @csh AC_TEXTEXTRACTOR_MIME_ID
Parent elements:
com.exalead.indexing.analysis.v10.AnalysisPipeline (as AnalysisPipeline)
com.exalead.indexing.analysis.v10.DocumentProcessorGroup (as DocumentProcessorGroup)
Attributes:
Name
Type
Default value
Description
name
string
Name of this processor. The name of a processor is used only for tracing and debugging purposes.
dataModelState
string
Is this document processor managed by a data model? @enum{null,auto,customized, error}.
If null, this document processor is not related to a data model.
If "auto", this document processor is auto-generated by a data model.
If "customized", this document processor was auto-generated by a data model and then customized.
If "error", there is a conflict between this document processor and the data model.
dataModelClass
string
If dataModelState is either "auto" or "customized", you will find here the name of the DataModelClass that generated this DocumentProcessor.
dataModelProperty
string
If dataModelState is either "auto" or "customized", you will find here the name of the DataModelProperty that generated this DocumentProcessor.
disabled
boolean
Disable the DocumentProcessor
looseTextDetection
boolean
True
Looses text detection to detect more text files, including suspicious ones (not *.txt or *.html) ("true", "false")
forceContent
boolean
Forces to accept the content, even if the MIME type does not seem to be a known or supported MIME type.
minInputSizeKB
long
-1
Minimum document size accepted, in kilobytes.
maxInputSizeKB
long
-1
Maximum document size accepted, in kilobytes.
maxRecursionDepth
int
-1
Maximum recursion depth.
maxRecursionDocuments
int
-1
Maximum number of documents that can be converted in one directory level.
maxRecursionDocumentsTotal
int
-1
Maximum number of documents that can be converted over all levels.
strictSizeCheck
boolean
Strict size validation mode (even for partial reads).
retryIO
string
Uses regular I/O when mmap fails. ("true", "false")
filter
string
Native filter identifier list to be used specifically. The list is a comma-separated (,) list of filter identifiers with optional ending argument(s) separated by semi-colons (;). If the filter identifier is prefixed by '!', the corresponding filter will be explicitly excluded. The special filter identifier '*' stands for "all other filters". First match wins: "*,!doc" is identical to "*". For example: filter="!jpeg,*" will accept all filters but the jpeg filter.
timeoutMs
long
-1
Conversion timeout value, in milliseconds. If the conversion process takes longer, the remote side attemps to abort the conversion process.
priority
string
Worker thread(s) priority to be used for the processing ("normal", "lowest", "very low", "low", "normal", "high", "very high")
embedded
string
Includes embedded images ("true", "false", "optional")
attachments
string
Includes embedded attachments ("true", "false", "optional")
styles
string
Attempts to extract more text styles for HTML conversion ("true", "false", "optional")
forceConversion
boolean
Attempts to generate an empty document upon conversion error (may be ignored)
startPage
long
-1
Starts conversion from this page number (page number starts at 1). This parameter is only taken into account for image processing and may be ignored.
maxPages
long
-1
Maximum number of pages to process for xml conversion (may be ignored).
maxOutputSizeKB
long
-1
Maximum output size on the remote side, in kilobytes. If the generated output exceeds this value, the document may be truncated or invalid.
allowUnicode32
boolean
Allows the use of 32-bit unicode points.
allowDocumentChars
boolean
Allows the use of Unicode private range characters (E0XX) for separators (keyword, sentence, paragraph separators, ...)
outsideIn
string
This feature is no longer supported. ("true", "false", "optional")
outsideInFallback
string
This feature is no longer supported. ("true", "false", "optional")
outsideInOnly
string
This feature is no longer supported. ("true", "false", "optional")
outsideInForPreview
string
This feature is no longer supported. ("true", "false", "optional")
outsideInSimpleXHTMLFallback
string
This feature is no longer supported. ("true", "false", "optional")
ocr
string
Converts using OCR ("true", "false", "optional")
ocrFallback
string
Fallback to OCR if heuristics deem it necessary ("true", "false", "optional")
ocrDetect
string
Detects documents requiring OCR (and rejects them) ("true", "false")
ocrQuality
string
OCR quality ("fast", "normal", "best")
ocrLang
string
OCR language(s) ("en" for English, "en;fr" for French and English, etc.)
ocrTimeoutMs
long
-1
OCR conversion timeout value, in milliseconds. If the OCR process takes longer, the remote side attemps to abort the conversion process. This value overrides the timeout value if the processing involves an OCR operation.
ocrMaxPages
int
-1
Maximum number of pages to process for OCR.
ocrPriority
string
Worker thread(s) priority to be used for the OCR processing ("normal", "lowest", "very low", "low", "normal", "high", "very high")
httpProxyUrl
string
Optional HTTP proxy URL. The URL can embed credentials if required.
disablePlugins
boolean
Disables external plugins.
overrideAddresses
string
Nested elements:
Name
Type
Description
fromDataModel
com.exalead.indexing.analysis.v10.DocumentProcessor
If dataModelState is "customized", you will find here the original document processor generated by the data model. Use this to easily revert to "auto" state from "customized". @IgnoreForValueConstructor
AcceptCondition
com.exalead.indexing.analysis.v10.AcceptCondition
Expresses the enablement condition of this DocumentProcessor.
KeyValue
exa.bee.KeyValue*