XML Configuration Reference : Index : AnalysisPipeline
 
AnalysisPipeline
com.exalead.indexing.analysis.v10.AnalysisPipeline
A document analysis pipeline. Each pipeline has an associated accept condition. This condition is tested for each input document. If a document matches the condition, it is processed by this pipeline. If not, the condition is tested for the next pipeline in the list of pipelines defined in a DocumentAnalysis object. A document refused by all pipelines is neither processed nor indexed. Pipeline processing is made of several stages:
Document Processing Stage - is performed by a list of DocumentProcessor which process each Document sequentially. Document Processors manipulate the 'DocumentParts' (binary data pushed through the PAPI) and the 'DocumentChunks' (textual data obtained either from PAPI meta or by processing of Document Part or by processing of pre-existing Document Chunks) Each DocumentChunk has a textual content, a ContextName, a language, a score, may belong to a DocumentPart. A DocumentChunk belonging to no DocumentPart is called a root DocumentChunk.
Semantic Processing Stage - involves a list of SemanticProcessor which process each Document Chunk of each Document sequentially (except those for which Semantic Processing is disabled in the mapping). Semantic Processing segments text into 'tokens' and then processes text as a flow of tokens. SemanticAnnotations are produced on each token.
Mapping - involves mapping DocumentChunk and Semantic Annotations to index fields.
Parent elements:
com.exalead.indexing.analysis.v10.AnalysisConfig (as AnalysisConfig)
Attributes:
Name
Type
Default value
Description
name
string
errorAction
string
continue
Specifies the action to launch if there is a document error during processing:
"discard": Discards the document from the job. If the document was already in the index, it's not removed if it already existed.
"delete": Discards the document from the job and deletes it from the index.
"continue": Keeps processing the document. The document will probably be incomplete in the index.
reportDocumentErrors
boolean
True
Reports the document errors in the global reporting store, for further analysis.
globalLogDocumentErrors
boolean
Logs errors and exceptions reported by the processors in the global log (without stack trace).
autoBlacklistDocuments
boolean
True
Tries to blacklist the documents triggering serious failure automatically. This option helps preventing loop failures, that is to say, when documents always trigger the same analysis process failures.
tokenizationConfig
string
Reference to the TokenizationConfig object to use for tokenization during Semantic Processing Stage.
autoconfigureFromDataModel
boolean
True
documentProcessorsProfiling
boolean
Logs the CPU time spent for each document processor and for the main indexing phase. The total time spent for each processor is dumped in the analyzer log at the end of the job.
semanticPipeTimeout
int
CPU-time limit for the processing of a text chunk by the semantic pipe, in seconds.
slowDocumentWarningTimeUS
long
5000000
If the processing of a document is longer than this time, a message will be printed in the analyzer log. A value of 0 disables the warning feature.
semanticProcessorsProfiling
boolean
Logs the CPU time spent for each semantic processor. The total time spent for each processor is dumped in the analyzer log at the end of the job. Warning: This feature strongly impacts performance, only enable it if required.
Nested elements:
Name
Type
Description
AcceptCondition
com.exalead.indexing.analysis.v10.AcceptCondition
DocumentProcessor
com.exalead.indexing.analysis.v10.DocumentProcessor*
FilteringConfiguration
com.exalead.indexing.analysis.v10.FilteringConfiguration
LanguageConfiguration
com.exalead.indexing.analysis.v10.LanguageConfiguration*
MappingConfiguration
com.exalead.indexing.analysis.v10.MappingConfiguration
SemanticProcessor
com.exalead.indexing.analysis.v10.SemanticProcessor*