HTMLRelevantContentExtractor

Name	Type	Default value	Description
name	string		Name of this processor. The name of a processor is used only for tracing and debugging purposes.
dataModelState	string		Is this document processor managed by a data model? @enum{null,auto,customized, error}. • If null, this document processor is not related to a data model. • If "auto", this document processor is auto-generated by a data model. • If "customized", this document processor was auto-generated by a data model and then customized. • If "error", there is a conflict between this document processor and the data model.
dataModelClass	string		If dataModelState is either "auto" or "customized", you will find here the name of the DataModelClass that generated this DocumentProcessor.
dataModelProperty	string		If dataModelState is either "auto" or "customized", you will find here the name of the DataModelProperty that generated this DocumentProcessor.
disabled	boolean		Disable the DocumentProcessor
relevantChunkContext	string	relevantcontent	Relevant text chunks will be copied in this context.
newContextName	string	relevantcontent	Deprecated, use 'relevantChunkContext'.
irrelevantChunkContext	string	excludedcontent	Irrelevant text chunks will be copied in this context.
retrieveFieldContext	string	htmlcontent	Original text chunks will be moved in this context.
irrelevantChunkAnnotation	string		If set, the HTMLRelevantContentExtractor will annotate each irrelevant chunk with an annotation.
minScore	int	15	Internally, the HTMLRelevantContentExtractor assigns a score to each chunk of its input. Use 'minScore' to keep only chunks having a score greater than a value.
minParagraphWords	int	7	The minimum number of words a <p> chunk must have to be considered as a paragraph and be boosted.
minTitleWords	int	3	The minimum number of words a title must have to be boosted.
linkAllowedInTitle	boolean	True	By default, the links contained in a page title produce a malus, this can be disabled.
paragraphBoost	int	10	Each time a paragraph will be detected, the score will be increased by this value.
maxWordInLinkRatio	int	2	The maximum allowed ratio of words contained in links in a chunk of text.
titleBoost	int	5	Each time a title will be detected, the score will be increased by this value.
classBoost	int	10	Each time a CSS class included in 'idsAndClassesToKeep' will be detected, the score will be increased by this value.
keepOnlyBestChunk	boolean		If true, the 'relevantcontent' will only be composed by the main article of the page.
skipBlockquotes	boolean		Ability to skip HTML blockquote tags.
skipPre	boolean		Ability to skip HTML pre tags.
keepImages	boolean		If true, the HTML image annotations will be kept in the new context.

Name	Type	Description
fromDataModel	com.exalead.indexing.analysis.v10.DocumentProcessor	If dataModelState is "customized", you will find here the original document processor generated by the data model. Use this to easily revert to "auto" state from "customized". @IgnoreForValueConstructor
idsAndClassesToIgnore	exa.bee.StringValue*	The list of CSS classes and HTML ids to ignore.
idsAndClassesToKeep	exa.bee.StringValue*	The list of CSS classes and HTML ids to boost.
annotationsToCopy	exa.bee.StringValue*	The list of annotations to keep in the new context.
AcceptCondition	com.exalead.indexing.analysis.v10.AcceptCondition	Expresses the enablement condition of this DocumentProcessor.