XML Configuration Reference : Index : HTMLRelevantContentExtractor
 
HTMLRelevantContentExtractor
com.exalead.indexing.analysis.v10.HTMLRelevantContentExtractor
The HTMLRelevantContentExtractor extracts the most relevant parts of an HTML document. Generally, the relevant part of an HTML document is the article on the middle of the page. The header, the footer and the menus are often the same on all pages and should not be indexed. The extraction can be tuned using different attributes. @csh AC_HTMLRELEVANT_CONTENT_ID
Parent elements:
com.exalead.indexing.analysis.v10.AnalysisPipeline (as AnalysisPipeline)
com.exalead.indexing.analysis.v10.DocumentProcessorGroup (as DocumentProcessorGroup)
Attributes:
Name
Type
Default value
Description
name
string
Name of this processor. The name of a processor is used only for tracing and debugging purposes.
dataModelState
string
Is this document processor managed by a data model? @enum{null,auto,customized, error}.
If null, this document processor is not related to a data model.
If "auto", this document processor is auto-generated by a data model.
If "customized", this document processor was auto-generated by a data model and then customized.
If "error", there is a conflict between this document processor and the data model.
dataModelClass
string
If dataModelState is either "auto" or "customized", you will find here the name of the DataModelClass that generated this DocumentProcessor.
dataModelProperty
string
If dataModelState is either "auto" or "customized", you will find here the name of the DataModelProperty that generated this DocumentProcessor.
disabled
boolean
Disable the DocumentProcessor
relevantChunkContext
string
relevantcontent
Relevant text chunks will be copied in this context.
newContextName
string
relevantcontent
Deprecated, use 'relevantChunkContext'.
irrelevantChunkContext
string
excludedcontent
Irrelevant text chunks will be copied in this context.
retrieveFieldContext
string
htmlcontent
Original text chunks will be moved in this context.
irrelevantChunkAnnotation
string
If set, the HTMLRelevantContentExtractor will annotate each irrelevant chunk with an annotation.
minScore
int
15
Internally, the HTMLRelevantContentExtractor assigns a score to each chunk of its input. Use 'minScore' to keep only chunks having a score greater than a value.
minParagraphWords
int
7
The minimum number of words a <p> chunk must have to be considered as a paragraph and be boosted.
minTitleWords
int
3
The minimum number of words a title must have to be boosted.
linkAllowedInTitle
boolean
True
By default, the links contained in a page title produce a malus, this can be disabled.
paragraphBoost
int
10
Each time a paragraph will be detected, the score will be increased by this value.
maxWordInLinkRatio
int
2
The maximum allowed ratio of words contained in links in a chunk of text.
titleBoost
int
5
Each time a title will be detected, the score will be increased by this value.
classBoost
int
10
Each time a CSS class included in 'idsAndClassesToKeep' will be detected, the score will be increased by this value.
keepOnlyBestChunk
boolean
If true, the 'relevantcontent' will only be composed by the main article of the page.
skipBlockquotes
boolean
Ability to skip HTML blockquote tags.
skipPre
boolean
Ability to skip HTML pre tags.
keepImages
boolean
If true, the HTML image annotations will be kept in the new context.
Nested elements:
Name
Type
Description
fromDataModel
com.exalead.indexing.analysis.v10.DocumentProcessor
If dataModelState is "customized", you will find here the original document processor generated by the data model. Use this to easily revert to "auto" state from "customized". @IgnoreForValueConstructor
idsAndClassesToIgnore
exa.bee.StringValue*
The list of CSS classes and HTML ids to ignore.
idsAndClassesToKeep
exa.bee.StringValue*
The list of CSS classes and HTML ids to boost.
annotationsToCopy
exa.bee.StringValue*
The list of annotations to keep in the new context.
AcceptCondition
com.exalead.indexing.analysis.v10.AcceptCondition
Expresses the enablement condition of this DocumentProcessor.