AnalysisPipeline

• A document analysis pipeline. Each pipeline has an associated accept condition. This condition is tested for each input document. If a document matches the condition, it is processed by this pipeline. If not, the condition is tested for the next pipeline in the list of pipelines defined in a DocumentAnalysis object. A document refused by all pipelines is neither processed nor indexed. Pipeline processing is made of several stages:

◦ Document Processing Stage - is performed by a list of DocumentProcessor which process each Document sequentially. Document Processors manipulate the 'DocumentParts' (binary data pushed through the PAPI) and the 'DocumentChunks' (textual data obtained either from PAPI meta or by processing of Document Part or by processing of pre-existing Document Chunks) Each DocumentChunk has a textual content, a ContextName, a language, a score, may belong to a DocumentPart. A DocumentChunk belonging to no DocumentPart is called a root DocumentChunk.

Name	Type	Default value	Description
name	string
errorAction	string	continue	Specifies the action to launch if there is a document error during processing: • "discard": Discards the document from the job. If the document was already in the index, it's not removed if it already existed. • "delete": Discards the document from the job and deletes it from the index. • "continue": Keeps processing the document. The document will probably be incomplete in the index.
reportDocumentErrors	boolean	True	Reports the document errors in the global reporting store, for further analysis.
globalLogDocumentErrors	boolean		Logs errors and exceptions reported by the processors in the global log (without stack trace).
autoBlacklistDocuments	boolean	True	Tries to blacklist the documents triggering serious failure automatically. This option helps preventing loop failures, that is to say, when documents always trigger the same analysis process failures.
tokenizationConfig	string		Reference to the TokenizationConfig object to use for tokenization during Semantic Processing Stage.
autoconfigureFromDataModel	boolean	True
documentProcessorsProfiling	boolean		Logs the CPU time spent for each document processor and for the main indexing phase. The total time spent for each processor is dumped in the analyzer log at the end of the job.
semanticPipeTimeout	int		CPU-time limit for the processing of a text chunk by the semantic pipe, in seconds.
slowDocumentWarningTimeUS	long	5000000	If the processing of a document is longer than this time, a message will be printed in the analyzer log. A value of 0 disables the warning feature.
semanticProcessorsProfiling	boolean		Logs the CPU time spent for each semantic processor. The total time spent for each processor is dumped in the analyzer log at the end of the job. Warning: This feature strongly impacts performance, only enable it if required.

Name	Type	Description
AcceptCondition	com.exalead.indexing.analysis.v10.AcceptCondition
DocumentProcessor	com.exalead.indexing.analysis.v10.DocumentProcessor*
FilteringConfiguration	com.exalead.indexing.analysis.v10.FilteringConfiguration
LanguageConfiguration	com.exalead.indexing.analysis.v10.LanguageConfiguration*
MappingConfiguration	com.exalead.indexing.analysis.v10.MappingConfiguration
SemanticProcessor	com.exalead.indexing.analysis.v10.SemanticProcessor*