Name | Type | Default value | Description |
---|---|---|---|
name | string | Name of this processor. The name of a processor is used only for tracing and debugging purposes. | |
dataModelState | string | Is this document processor managed by a data model? @enum{null,auto,customized, error}. • If null, this document processor is not related to a data model. • If "auto", this document processor is auto-generated by a data model. • If "customized", this document processor was auto-generated by a data model and then customized. • If "error", there is a conflict between this document processor and the data model. | |
dataModelClass | string | If dataModelState is either "auto" or "customized", you will find here the name of the DataModelClass that generated this DocumentProcessor. | |
dataModelProperty | string | If dataModelState is either "auto" or "customized", you will find here the name of the DataModelProperty that generated this DocumentProcessor. | |
disabled | boolean | Disable the DocumentProcessor | |
relevantChunkContext | string | relevantcontent | Relevant text chunks will be copied in this context. |
newContextName | string | relevantcontent | Deprecated, use 'relevantChunkContext'. |
irrelevantChunkContext | string | excludedcontent | Irrelevant text chunks will be copied in this context. |
retrieveFieldContext | string | htmlcontent | Original text chunks will be moved in this context. |
irrelevantChunkAnnotation | string | If set, the HTMLRelevantContentExtractor will annotate each irrelevant chunk with an annotation. | |
minScore | int | 15 | Internally, the HTMLRelevantContentExtractor assigns a score to each chunk of its input. Use 'minScore' to keep only chunks having a score greater than a value. |
minParagraphWords | int | 7 | The minimum number of words a <p> chunk must have to be considered as a paragraph and be boosted. |
minTitleWords | int | 3 | The minimum number of words a title must have to be boosted. |
linkAllowedInTitle | boolean | True | By default, the links contained in a page title produce a malus, this can be disabled. |
paragraphBoost | int | 10 | Each time a paragraph will be detected, the score will be increased by this value. |
maxWordInLinkRatio | int | 2 | The maximum allowed ratio of words contained in links in a chunk of text. |
titleBoost | int | 5 | Each time a title will be detected, the score will be increased by this value. |
classBoost | int | 10 | Each time a CSS class included in 'idsAndClassesToKeep' will be detected, the score will be increased by this value. |
keepOnlyBestChunk | boolean | If true, the 'relevantcontent' will only be composed by the main article of the page. | |
skipBlockquotes | boolean | Ability to skip HTML blockquote tags. | |
skipPre | boolean | Ability to skip HTML pre tags. | |
keepImages | boolean | If true, the HTML image annotations will be kept in the new context. |
Name | Type | Description |
---|---|---|
fromDataModel | com.exalead.indexing.analysis.v10.DocumentProcessor | If dataModelState is "customized", you will find here the original document processor generated by the data model. Use this to easily revert to "auto" state from "customized". @IgnoreForValueConstructor |
idsAndClassesToIgnore | exa.bee.StringValue* | The list of CSS classes and HTML ids to ignore. |
idsAndClassesToKeep | exa.bee.StringValue* | The list of CSS classes and HTML ids to boost. |
annotationsToCopy | exa.bee.StringValue* | The list of annotations to keep in the new context. |
AcceptCondition | com.exalead.indexing.analysis.v10.AcceptCondition | Expresses the enablement condition of this DocumentProcessor. |