HTMLRelevantContentExtractor (NETVIBES Exalead CloudView Clients SDK)

java.lang.Object
- com.exalead.indexing.analysis.v10.DocumentProcessor
- - com.exalead.indexing.analysis.v10.HTMLRelevantContentExtractor

All Implemented Interfaces:

com.exalead.util.Checkable, java.io.Serializable
```
public class HTMLRelevantContentExtractor
extends DocumentProcessor
implements com.exalead.util.Checkable, java.io.Serializable
```
The HTMLRelevantContentExtractor extracts the most relevant parts of an HTML document.
Generally, the relevant part of an HTML document is the article on the middle of the page. The header, the footer and the menus are often the same on all pages and should not be indexed.
The extraction can be tuned using different attributes.

See Also:

Serialized Form

Nested Class Summary

Nested Classes
Modifier and Type	Class and Description
`static class`	`HTMLRelevantContentExtractor.AnnotationsToCopy`
`static class`	`HTMLRelevantContentExtractor.IdsAndClassesToIgnore`
`static class`	`HTMLRelevantContentExtractor.IdsAndClassesToKeep`

Nested classes/interfaces inherited from class com.exalead.indexing.analysis.v10.DocumentProcessor
DocumentProcessor.FromDataModel, DocumentProcessor.Transformer<T>

Field Summary

Fields
Modifier and Type	Field and Description
`protected HTMLRelevantContentExtractor.AnnotationsToCopy`	`annotationsToCopy`
`int`	`classBoost`
`static int`	`DEFAULT_CLASS_BOOST`
`static java.lang.String`	`DEFAULT_IRRELEVANT_CHUNK_CONTEXT`
`static boolean`	`DEFAULT_KEEP_IMAGES`
`static boolean`	`DEFAULT_KEEP_ONLY_BEST_CHUNK`
`static boolean`	`DEFAULT_LINK_ALLOWED_IN_TITLE`
`static int`	`DEFAULT_MAX_WORD_IN_LINK_RATIO`
`static int`	`DEFAULT_MIN_PARAGRAPH_WORDS`
`static int`	`DEFAULT_MIN_SCORE`
`static int`	`DEFAULT_MIN_TITLE_WORDS`
`static java.lang.String`	`DEFAULT_NEW_CONTEXT_NAME`
`static int`	`DEFAULT_PARAGRAPH_BOOST`
`static java.lang.String`	`DEFAULT_RELEVANT_CHUNK_CONTEXT`
`static java.lang.String`	`DEFAULT_RETRIEVE_FIELD_CONTEXT`
`static boolean`	`DEFAULT_SKIP_BLOCKQUOTES`
`static boolean`	`DEFAULT_SKIP_PRE`
`static int`	`DEFAULT_TITLE_BOOST`
`protected HTMLRelevantContentExtractor.IdsAndClassesToIgnore`	`idsAndClassesToIgnore`
`protected HTMLRelevantContentExtractor.IdsAndClassesToKeep`	`idsAndClassesToKeep`
`java.lang.String`	`irrelevantChunkAnnotation`
`java.lang.String`	`irrelevantChunkContext`
`boolean`	`keepImages`
`boolean`	`keepOnlyBestChunk`
`boolean`	`linkAllowedInTitle`
`int`	`maxWordInLinkRatio`
`int`	`minParagraphWords`
`int`	`minScore`
`int`	`minTitleWords`
`java.lang.String`	`newContextName` Deprecated.
`int`	`paragraphBoost`
`java.lang.String`	`relevantChunkContext`
`java.lang.String`	`retrieveFieldContext`
`boolean`	`skipBlockquotes`
`boolean`	`skipPre`
`int`	`titleBoost`

Fields inherited from class com.exalead.indexing.analysis.v10.DocumentProcessor
acceptCondition, dataModelClass, dataModelProperty, dataModelState, DEFAULT_DISABLED, disabled, fromDataModel, name

Constructor Summary

Constructors
Constructor and Description

HTMLRelevantContentExtractor()

HTMLRelevantContentExtractor(HTMLRelevantContentExtractor o)
Copy constructor

Constructors
Constructor and Description
`HTMLRelevantContentExtractor()`
`HTMLRelevantContentExtractor(HTMLRelevantContentExtractor o)` Copy constructor

Method Summary

All Methods Static Methods Instance Methods Concrete Methods Deprecated Methods
Modifier and Type	Method and Description
`<T> T`	`accept(DocumentProcessor.Transformer<T> transformer, T[] t)`
`void`	`check(boolean deep, java.lang.String errorContext)` Checks this HTMLRelevantContentExtractor.
`static HTMLRelevantContentExtractor`	`fromString(java.lang.String s)` String representation of this HTMLRelevantContentExtractor.
`HTMLRelevantContentExtractor.AnnotationsToCopy`	`getAnnotationsToCopy()`
`int`	`getClassBoost()` Each time a CSS class included in 'idsAndClassesToKeep' will be detected, the score will be increased by this value.
`HTMLRelevantContentExtractor.IdsAndClassesToIgnore`	`getIdsAndClassesToIgnore()`
`HTMLRelevantContentExtractor.IdsAndClassesToKeep`	`getIdsAndClassesToKeep()`
`java.lang.String`	`getIrrelevantChunkAnnotation()` If set, the HTMLRelevantContentExtractor will annotate each irrelevant chunk with an annotation.
`java.lang.String`	`getIrrelevantChunkContext()` Irrelevant text chunks will be copied in this context.
`int`	`getMaxWordInLinkRatio()` The maximum allowed ratio of words contained in links in a chunk of text.
`int`	`getMinParagraphWords()` The minimum number of words a <p> chunk must have to be considered as a paragraph and be boosted.
`int`	`getMinScore()` Internally, the HTMLRelevantContentExtractor assigns a score to each chunk of its input.
`int`	`getMinTitleWords()` The minimum number of words a title must have to be boosted.
`java.lang.String`	`getNewContextName()` Deprecated.
`int`	`getParagraphBoost()` Each time a paragraph will be detected, the score will be increased by this value.
`java.lang.String`	`getRelevantChunkContext()` Relevant text chunks will be copied in this context.
`java.lang.String`	`getRetrieveFieldContext()` Original text chunks will be moved in this context.
`int`	`getTitleBoost()` Each time a title will be detected, the score will be increased by this value.
`boolean`	`isKeepImages()` If true, the HTML image annotations will be kept in the new context.
`boolean`	`isKeepOnlyBestChunk()` If true, the 'relevantcontent' will only be composed by the main article of the page.
`boolean`	`isLinkAllowedInTitle()` By default, the links contained in a page title produce a malus, this can be disabled.
`boolean`	`isSkipBlockquotes()` Ability to skip HTML blockquote tags.
`boolean`	`isSkipPre()` Ability to skip HTML pre tags.
`HTMLRelevantContentExtractor`	`makeCopy()` Creates and returns a deep copy of this HTMLRelevantContentExtractor.
`static HTMLRelevantContentExtractor`	`readFrom(java.io.InputStream is)` Read this HTMLRelevantContentExtractor from an XML fragment.
`void`	`setAnnotationsToCopy(HTMLRelevantContentExtractor.AnnotationsToCopy __value)`
`void`	`setClassBoost(int classBoost)` Each time a CSS class included in 'idsAndClassesToKeep' will be detected, the score will be increased by this value.
`void`	`setIdsAndClassesToIgnore(HTMLRelevantContentExtractor.IdsAndClassesToIgnore __value)`
`void`	`setIdsAndClassesToKeep(HTMLRelevantContentExtractor.IdsAndClassesToKeep __value)`
`void`	`setIrrelevantChunkAnnotation(java.lang.String irrelevantChunkAnnotation)` If set, the HTMLRelevantContentExtractor will annotate each irrelevant chunk with an annotation.
`void`	`setIrrelevantChunkContext(java.lang.String irrelevantChunkContext)` Irrelevant text chunks will be copied in this context.
`void`	`setKeepImages(boolean keepImages)` If true, the HTML image annotations will be kept in the new context.
`void`	`setKeepOnlyBestChunk(boolean keepOnlyBestChunk)` If true, the 'relevantcontent' will only be composed by the main article of the page.
`void`	`setLinkAllowedInTitle(boolean linkAllowedInTitle)` By default, the links contained in a page title produce a malus, this can be disabled.
`void`	`setMaxWordInLinkRatio(int maxWordInLinkRatio)` The maximum allowed ratio of words contained in links in a chunk of text.
`void`	`setMinParagraphWords(int minParagraphWords)` The minimum number of words a <p> chunk must have to be considered as a paragraph and be boosted.
`void`	`setMinScore(int minScore)` Internally, the HTMLRelevantContentExtractor assigns a score to each chunk of its input.
`void`	`setMinTitleWords(int minTitleWords)` The minimum number of words a title must have to be boosted.
`void`	`setNewContextName(java.lang.String newContextName)` Deprecated.
`void`	`setParagraphBoost(int paragraphBoost)` Each time a paragraph will be detected, the score will be increased by this value.
`void`	`setRelevantChunkContext(java.lang.String relevantChunkContext)` Relevant text chunks will be copied in this context.
`void`	`setRetrieveFieldContext(java.lang.String retrieveFieldContext)` Original text chunks will be moved in this context.
`void`	`setSkipBlockquotes(boolean skipBlockquotes)` Ability to skip HTML blockquote tags.
`void`	`setSkipPre(boolean skipPre)` Ability to skip HTML pre tags.
`void`	`setTitleBoost(int titleBoost)` Each time a title will be detected, the score will be increased by this value.
`java.lang.String`	`toString()` String representation of this HTMLRelevantContentExtractor.
`HTMLRelevantContentExtractor`	`withAcceptCondition(AcceptCondition acceptCondition)`
`HTMLRelevantContentExtractor`	`withAnnotationsToCopy(java.util.Collection<StringValue> __values)`
`HTMLRelevantContentExtractor`	`withAnnotationsToCopy(HTMLRelevantContentExtractor.AnnotationsToCopy __value)`
`HTMLRelevantContentExtractor`	`withAnnotationsToCopy(StringValue... __values)`
`HTMLRelevantContentExtractor`	`withClassBoost(int classBoost)`
`HTMLRelevantContentExtractor`	`withClassBoost(java.lang.Integer classBoost)`
`HTMLRelevantContentExtractor`	`withDataModelClass(java.lang.String dataModelClass)`
`HTMLRelevantContentExtractor`	`withDataModelProperty(java.lang.String dataModelProperty)`
`HTMLRelevantContentExtractor`	`withDataModelState(java.lang.String dataModelState)`
`HTMLRelevantContentExtractor`	`withDisabled(boolean disabled)`
`HTMLRelevantContentExtractor`	`withDisabled(java.lang.Boolean disabled)`
`HTMLRelevantContentExtractor`	`withFromDataModel(DocumentProcessor fromDataModel)`
`HTMLRelevantContentExtractor`	`withIdsAndClassesToIgnore(java.util.Collection<StringValue> __values)`
`HTMLRelevantContentExtractor`	`withIdsAndClassesToIgnore(HTMLRelevantContentExtractor.IdsAndClassesToIgnore __value)`
`HTMLRelevantContentExtractor`	`withIdsAndClassesToIgnore(StringValue... __values)`
`HTMLRelevantContentExtractor`	`withIdsAndClassesToKeep(java.util.Collection<StringValue> __values)`
`HTMLRelevantContentExtractor`	`withIdsAndClassesToKeep(HTMLRelevantContentExtractor.IdsAndClassesToKeep __value)`
`HTMLRelevantContentExtractor`	`withIdsAndClassesToKeep(StringValue... __values)`
`HTMLRelevantContentExtractor`	`withIrrelevantChunkAnnotation(java.lang.String irrelevantChunkAnnotation)`
`HTMLRelevantContentExtractor`	`withIrrelevantChunkContext(java.lang.String irrelevantChunkContext)`
`HTMLRelevantContentExtractor`	`withKeepImages(boolean keepImages)`
`HTMLRelevantContentExtractor`	`withKeepImages(java.lang.Boolean keepImages)`
`HTMLRelevantContentExtractor`	`withKeepOnlyBestChunk(boolean keepOnlyBestChunk)`
`HTMLRelevantContentExtractor`	`withKeepOnlyBestChunk(java.lang.Boolean keepOnlyBestChunk)`
`HTMLRelevantContentExtractor`	`withLinkAllowedInTitle(boolean linkAllowedInTitle)`
`HTMLRelevantContentExtractor`	`withLinkAllowedInTitle(java.lang.Boolean linkAllowedInTitle)`
`HTMLRelevantContentExtractor`	`withMaxWordInLinkRatio(int maxWordInLinkRatio)`
`HTMLRelevantContentExtractor`	`withMaxWordInLinkRatio(java.lang.Integer maxWordInLinkRatio)`
`HTMLRelevantContentExtractor`	`withMinParagraphWords(int minParagraphWords)`
`HTMLRelevantContentExtractor`	`withMinParagraphWords(java.lang.Integer minParagraphWords)`
`HTMLRelevantContentExtractor`	`withMinScore(int minScore)`
`HTMLRelevantContentExtractor`	`withMinScore(java.lang.Integer minScore)`
`HTMLRelevantContentExtractor`	`withMinTitleWords(int minTitleWords)`
`HTMLRelevantContentExtractor`	`withMinTitleWords(java.lang.Integer minTitleWords)`
`HTMLRelevantContentExtractor`	`withName(java.lang.String name)`
`HTMLRelevantContentExtractor`	`withNewContextName(java.lang.String newContextName)` Deprecated.
`HTMLRelevantContentExtractor`	`withParagraphBoost(int paragraphBoost)`
`HTMLRelevantContentExtractor`	`withParagraphBoost(java.lang.Integer paragraphBoost)`
`HTMLRelevantContentExtractor`	`withRelevantChunkContext(java.lang.String relevantChunkContext)`
`HTMLRelevantContentExtractor`	`withRetrieveFieldContext(java.lang.String retrieveFieldContext)`
`HTMLRelevantContentExtractor`	`withSkipBlockquotes(boolean skipBlockquotes)`
`HTMLRelevantContentExtractor`	`withSkipBlockquotes(java.lang.Boolean skipBlockquotes)`
`HTMLRelevantContentExtractor`	`withSkipPre(boolean skipPre)`
`HTMLRelevantContentExtractor`	`withSkipPre(java.lang.Boolean skipPre)`
`HTMLRelevantContentExtractor`	`withTitleBoost(int titleBoost)`
`HTMLRelevantContentExtractor`	`withTitleBoost(java.lang.Integer titleBoost)`
`void`	`writeTo(java.io.OutputStream os)` Write this HTMLRelevantContentExtractor as an XML fragment

Methods inherited from class com.exalead.indexing.analysis.v10.DocumentProcessor
getAcceptCondition, getDataModelClass, getDataModelProperty, getDataModelState, getFromDataModel, getName, isDisabled, setAcceptCondition, setDataModelClass, setDataModelProperty, setDataModelState, setDisabled, setFromDataModel, setName

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait

Field Detail

relevantChunkContext

public java.lang.String relevantChunkContext

DEFAULT_RELEVANT_CHUNK_CONTEXT

public static final java.lang.String DEFAULT_RELEVANT_CHUNK_CONTEXT

See Also:: Constant Field Values

newContextName

@Deprecated
public java.lang.String newContextName

Deprecated.

DEFAULT_NEW_CONTEXT_NAME

public static final java.lang.String DEFAULT_NEW_CONTEXT_NAME

See Also:: Constant Field Values

irrelevantChunkContext

public java.lang.String irrelevantChunkContext

DEFAULT_IRRELEVANT_CHUNK_CONTEXT

public static final java.lang.String DEFAULT_IRRELEVANT_CHUNK_CONTEXT

See Also:: Constant Field Values

retrieveFieldContext

public java.lang.String retrieveFieldContext

DEFAULT_RETRIEVE_FIELD_CONTEXT

public static final java.lang.String DEFAULT_RETRIEVE_FIELD_CONTEXT

See Also:: Constant Field Values

irrelevantChunkAnnotation

public java.lang.String irrelevantChunkAnnotation

minScore
```
public int minScore
```

DEFAULT_MIN_SCORE
```
public static final int DEFAULT_MIN_SCORE
```
See Also:

Constant Field Values

minParagraphWords
```
public int minParagraphWords
```

DEFAULT_MIN_PARAGRAPH_WORDS
```
public static final int DEFAULT_MIN_PARAGRAPH_WORDS
```
See Also:

Constant Field Values

minTitleWords
```
public int minTitleWords
```

DEFAULT_MIN_TITLE_WORDS
```
public static final int DEFAULT_MIN_TITLE_WORDS
```
See Also:

Constant Field Values

linkAllowedInTitle
```
public boolean linkAllowedInTitle
```

DEFAULT_LINK_ALLOWED_IN_TITLE
```
public static final boolean DEFAULT_LINK_ALLOWED_IN_TITLE
```
See Also:

Constant Field Values

paragraphBoost
```
public int paragraphBoost
```

DEFAULT_PARAGRAPH_BOOST
```
public static final int DEFAULT_PARAGRAPH_BOOST
```
See Also:

Constant Field Values

maxWordInLinkRatio
```
public int maxWordInLinkRatio
```

DEFAULT_MAX_WORD_IN_LINK_RATIO
```
public static final int DEFAULT_MAX_WORD_IN_LINK_RATIO
```
See Also:

Constant Field Values

titleBoost
```
public int titleBoost
```

DEFAULT_TITLE_BOOST
```
public static final int DEFAULT_TITLE_BOOST
```
See Also:

Constant Field Values

classBoost
```
public int classBoost
```

DEFAULT_CLASS_BOOST
```
public static final int DEFAULT_CLASS_BOOST
```
See Also:

Constant Field Values

idsAndClassesToIgnore

protected HTMLRelevantContentExtractor.IdsAndClassesToIgnore idsAndClassesToIgnore

idsAndClassesToKeep

protected HTMLRelevantContentExtractor.IdsAndClassesToKeep idsAndClassesToKeep

keepOnlyBestChunk
```
public boolean keepOnlyBestChunk
```

DEFAULT_KEEP_ONLY_BEST_CHUNK
```
public static final boolean DEFAULT_KEEP_ONLY_BEST_CHUNK
```
See Also:

Constant Field Values

skipBlockquotes
```
public boolean skipBlockquotes
```

DEFAULT_SKIP_BLOCKQUOTES
```
public static final boolean DEFAULT_SKIP_BLOCKQUOTES
```
See Also:

Constant Field Values

skipPre
```
public boolean skipPre
```

DEFAULT_SKIP_PRE
```
public static final boolean DEFAULT_SKIP_PRE
```
See Also:

Constant Field Values

keepImages
```
public boolean keepImages
```

DEFAULT_KEEP_IMAGES
```
public static final boolean DEFAULT_KEEP_IMAGES
```
See Also:

Constant Field Values

annotationsToCopy

protected HTMLRelevantContentExtractor.AnnotationsToCopy annotationsToCopy

Constructor Detail

HTMLRelevantContentExtractor
```
public HTMLRelevantContentExtractor()
```

HTMLRelevantContentExtractor

public HTMLRelevantContentExtractor(HTMLRelevantContentExtractor o)

Copy constructor

Method Detail

withAcceptCondition

public HTMLRelevantContentExtractor withAcceptCondition(AcceptCondition acceptCondition)

Overrides:: withAcceptCondition in class DocumentProcessor

withName

public HTMLRelevantContentExtractor withName(java.lang.String name)

Overrides:: withName in class DocumentProcessor

withDataModelState

public HTMLRelevantContentExtractor withDataModelState(java.lang.String dataModelState)

Overrides:: withDataModelState in class DocumentProcessor

withFromDataModel

public HTMLRelevantContentExtractor withFromDataModel(DocumentProcessor fromDataModel)

withDataModelClass

public HTMLRelevantContentExtractor withDataModelClass(java.lang.String dataModelClass)

Overrides:: withDataModelClass in class DocumentProcessor

withDataModelProperty

public HTMLRelevantContentExtractor withDataModelProperty(java.lang.String dataModelProperty)

Overrides:: withDataModelProperty in class DocumentProcessor

withDisabled

public HTMLRelevantContentExtractor withDisabled(boolean disabled)

Overrides:: withDisabled in class DocumentProcessor

withDisabled

public HTMLRelevantContentExtractor withDisabled(java.lang.Boolean disabled)

Overrides:: withDisabled in class DocumentProcessor

setRelevantChunkContext

public void setRelevantChunkContext(java.lang.String relevantChunkContext)

Relevant text chunks will be copied in this context.

getRelevantChunkContext
```
public java.lang.String getRelevantChunkContext()
```
Relevant text chunks will be copied in this context.

withRelevantChunkContext

public HTMLRelevantContentExtractor withRelevantChunkContext(java.lang.String relevantChunkContext)

setNewContextName

@Deprecated
public void setNewContextName(java.lang.String newContextName)

Deprecated.

Deprecated, use 'relevantChunkContext'.

getNewContextName
```
@Deprecated
public java.lang.String getNewContextName()
```
Deprecated.

Deprecated, use 'relevantChunkContext'.

withNewContextName

@Deprecated
public HTMLRelevantContentExtractor withNewContextName(java.lang.String newContextName)

Deprecated.

setIrrelevantChunkContext

public void setIrrelevantChunkContext(java.lang.String irrelevantChunkContext)

Irrelevant text chunks will be copied in this context.

getIrrelevantChunkContext
```
public java.lang.String getIrrelevantChunkContext()
```
Irrelevant text chunks will be copied in this context.

withIrrelevantChunkContext

public HTMLRelevantContentExtractor withIrrelevantChunkContext(java.lang.String irrelevantChunkContext)

setRetrieveFieldContext

public void setRetrieveFieldContext(java.lang.String retrieveFieldContext)

Original text chunks will be moved in this context.

getRetrieveFieldContext
```
public java.lang.String getRetrieveFieldContext()
```
Original text chunks will be moved in this context.

withRetrieveFieldContext

public HTMLRelevantContentExtractor withRetrieveFieldContext(java.lang.String retrieveFieldContext)

setIrrelevantChunkAnnotation
```
public void setIrrelevantChunkAnnotation(java.lang.String irrelevantChunkAnnotation)
```
If set, the HTMLRelevantContentExtractor will annotate each irrelevant chunk with an annotation.

getIrrelevantChunkAnnotation
```
public java.lang.String getIrrelevantChunkAnnotation()
```
If set, the HTMLRelevantContentExtractor will annotate each irrelevant chunk with an annotation.

withIrrelevantChunkAnnotation

public HTMLRelevantContentExtractor withIrrelevantChunkAnnotation(java.lang.String irrelevantChunkAnnotation)

setMinScore
```
public void setMinScore(int minScore)
```
Internally, the HTMLRelevantContentExtractor assigns a score to each chunk of its input. Use 'minScore' to keep only chunks having a score greater than a value.

getMinScore
```
public int getMinScore()
```
Internally, the HTMLRelevantContentExtractor assigns a score to each chunk of its input. Use 'minScore' to keep only chunks having a score greater than a value.

withMinScore

public HTMLRelevantContentExtractor withMinScore(int minScore)

withMinScore

public HTMLRelevantContentExtractor withMinScore(java.lang.Integer minScore)

setMinParagraphWords
```
public void setMinParagraphWords(int minParagraphWords)
```
The minimum number of words a <p> chunk must have to be considered as a paragraph and be boosted.

getMinParagraphWords
```
public int getMinParagraphWords()
```
The minimum number of words a <p> chunk must have to be considered as a paragraph and be boosted.

withMinParagraphWords

public HTMLRelevantContentExtractor withMinParagraphWords(int minParagraphWords)

withMinParagraphWords

public HTMLRelevantContentExtractor withMinParagraphWords(java.lang.Integer minParagraphWords)

setMinTitleWords
```
public void setMinTitleWords(int minTitleWords)
```
The minimum number of words a title must have to be boosted.

getMinTitleWords
```
public int getMinTitleWords()
```
The minimum number of words a title must have to be boosted.

withMinTitleWords

public HTMLRelevantContentExtractor withMinTitleWords(int minTitleWords)

withMinTitleWords

public HTMLRelevantContentExtractor withMinTitleWords(java.lang.Integer minTitleWords)

setLinkAllowedInTitle
```
public void setLinkAllowedInTitle(boolean linkAllowedInTitle)
```
By default, the links contained in a page title produce a malus, this can be disabled.

isLinkAllowedInTitle
```
public boolean isLinkAllowedInTitle()
```
By default, the links contained in a page title produce a malus, this can be disabled.

withLinkAllowedInTitle

public HTMLRelevantContentExtractor withLinkAllowedInTitle(boolean linkAllowedInTitle)

withLinkAllowedInTitle

public HTMLRelevantContentExtractor withLinkAllowedInTitle(java.lang.Boolean linkAllowedInTitle)

setParagraphBoost
```
public void setParagraphBoost(int paragraphBoost)
```
Each time a paragraph will be detected, the score will be increased by this value.

getParagraphBoost
```
public int getParagraphBoost()
```
Each time a paragraph will be detected, the score will be increased by this value.

withParagraphBoost

public HTMLRelevantContentExtractor withParagraphBoost(int paragraphBoost)

withParagraphBoost

public HTMLRelevantContentExtractor withParagraphBoost(java.lang.Integer paragraphBoost)

setMaxWordInLinkRatio
```
public void setMaxWordInLinkRatio(int maxWordInLinkRatio)
```
The maximum allowed ratio of words contained in links in a chunk of text.

getMaxWordInLinkRatio
```
public int getMaxWordInLinkRatio()
```
The maximum allowed ratio of words contained in links in a chunk of text.

withMaxWordInLinkRatio

public HTMLRelevantContentExtractor withMaxWordInLinkRatio(int maxWordInLinkRatio)

withMaxWordInLinkRatio

public HTMLRelevantContentExtractor withMaxWordInLinkRatio(java.lang.Integer maxWordInLinkRatio)

setTitleBoost
```
public void setTitleBoost(int titleBoost)
```
Each time a title will be detected, the score will be increased by this value.

getTitleBoost
```
public int getTitleBoost()
```
Each time a title will be detected, the score will be increased by this value.

withTitleBoost

public HTMLRelevantContentExtractor withTitleBoost(int titleBoost)

withTitleBoost

public HTMLRelevantContentExtractor withTitleBoost(java.lang.Integer titleBoost)

setClassBoost
```
public void setClassBoost(int classBoost)
```
Each time a CSS class included in 'idsAndClassesToKeep' will be detected, the score will be increased by this value.

getClassBoost
```
public int getClassBoost()
```
Each time a CSS class included in 'idsAndClassesToKeep' will be detected, the score will be increased by this value.

withClassBoost

public HTMLRelevantContentExtractor withClassBoost(int classBoost)

withClassBoost

public HTMLRelevantContentExtractor withClassBoost(java.lang.Integer classBoost)

getIdsAndClassesToIgnore

public HTMLRelevantContentExtractor.IdsAndClassesToIgnore getIdsAndClassesToIgnore()

setIdsAndClassesToIgnore

public void setIdsAndClassesToIgnore(HTMLRelevantContentExtractor.IdsAndClassesToIgnore __value)

withIdsAndClassesToIgnore

public HTMLRelevantContentExtractor withIdsAndClassesToIgnore(StringValue... __values)

withIdsAndClassesToIgnore

public HTMLRelevantContentExtractor withIdsAndClassesToIgnore(java.util.Collection<StringValue> __values)

withIdsAndClassesToIgnore

public HTMLRelevantContentExtractor withIdsAndClassesToIgnore(HTMLRelevantContentExtractor.IdsAndClassesToIgnore __value)

getIdsAndClassesToKeep

public HTMLRelevantContentExtractor.IdsAndClassesToKeep getIdsAndClassesToKeep()

setIdsAndClassesToKeep

public void setIdsAndClassesToKeep(HTMLRelevantContentExtractor.IdsAndClassesToKeep __value)

withIdsAndClassesToKeep

public HTMLRelevantContentExtractor withIdsAndClassesToKeep(StringValue... __values)

withIdsAndClassesToKeep

public HTMLRelevantContentExtractor withIdsAndClassesToKeep(java.util.Collection<StringValue> __values)

withIdsAndClassesToKeep

public HTMLRelevantContentExtractor withIdsAndClassesToKeep(HTMLRelevantContentExtractor.IdsAndClassesToKeep __value)

setKeepOnlyBestChunk
```
public void setKeepOnlyBestChunk(boolean keepOnlyBestChunk)
```
If true, the 'relevantcontent' will only be composed by the main article of the page.

isKeepOnlyBestChunk
```
public boolean isKeepOnlyBestChunk()
```
If true, the 'relevantcontent' will only be composed by the main article of the page.

withKeepOnlyBestChunk

public HTMLRelevantContentExtractor withKeepOnlyBestChunk(boolean keepOnlyBestChunk)

withKeepOnlyBestChunk

public HTMLRelevantContentExtractor withKeepOnlyBestChunk(java.lang.Boolean keepOnlyBestChunk)

setSkipBlockquotes

public void setSkipBlockquotes(boolean skipBlockquotes)

Ability to skip HTML blockquote tags.

isSkipBlockquotes
```
public boolean isSkipBlockquotes()
```
Ability to skip HTML blockquote tags.

withSkipBlockquotes

public HTMLRelevantContentExtractor withSkipBlockquotes(boolean skipBlockquotes)

withSkipBlockquotes

public HTMLRelevantContentExtractor withSkipBlockquotes(java.lang.Boolean skipBlockquotes)

setSkipPre

public void setSkipPre(boolean skipPre)

Ability to skip HTML pre tags.

isSkipPre
```
public boolean isSkipPre()
```
Ability to skip HTML pre tags.

withSkipPre

public HTMLRelevantContentExtractor withSkipPre(boolean skipPre)

withSkipPre

public HTMLRelevantContentExtractor withSkipPre(java.lang.Boolean skipPre)

setKeepImages
```
public void setKeepImages(boolean keepImages)
```
If true, the HTML image annotations will be kept in the new context.

isKeepImages
```
public boolean isKeepImages()
```
If true, the HTML image annotations will be kept in the new context.

withKeepImages

public HTMLRelevantContentExtractor withKeepImages(boolean keepImages)

withKeepImages

public HTMLRelevantContentExtractor withKeepImages(java.lang.Boolean keepImages)

getAnnotationsToCopy

public HTMLRelevantContentExtractor.AnnotationsToCopy getAnnotationsToCopy()

setAnnotationsToCopy

public void setAnnotationsToCopy(HTMLRelevantContentExtractor.AnnotationsToCopy __value)

withAnnotationsToCopy

public HTMLRelevantContentExtractor withAnnotationsToCopy(StringValue... __values)

withAnnotationsToCopy

public HTMLRelevantContentExtractor withAnnotationsToCopy(java.util.Collection<StringValue> __values)

withAnnotationsToCopy

public HTMLRelevantContentExtractor withAnnotationsToCopy(HTMLRelevantContentExtractor.AnnotationsToCopy __value)

makeCopy
```
public HTMLRelevantContentExtractor makeCopy()
```
Creates and returns a deep copy of this HTMLRelevantContentExtractor.

Overrides:

makeCopy in class DocumentProcessor

readFrom

public static HTMLRelevantContentExtractor readFrom(java.io.InputStream is)
                                             throws javax.xml.bind.JAXBException

Read this HTMLRelevantContentExtractor from an XML fragment.

Throws:: javax.xml.bind.JAXBException

writeTo
```
public void writeTo(java.io.OutputStream os)
             throws javax.xml.bind.JAXBException,
                    java.io.IOException
```
Write this HTMLRelevantContentExtractor as an XML fragment

Overrides:

writeTo in class DocumentProcessor

Throws:

javax.xml.bind.JAXBException

java.io.IOException

fromString

public static HTMLRelevantContentExtractor fromString(java.lang.String s)
                                               throws javax.xml.bind.JAXBException,
                                                      java.io.UnsupportedEncodingException

String representation of this HTMLRelevantContentExtractor.

Throws:: javax.xml.bind.JAXBException; java.io.UnsupportedEncodingException

toString
```
public java.lang.String toString()
```
String representation of this HTMLRelevantContentExtractor.

Overrides:

toString in class DocumentProcessor

check
```
public void check(boolean deep,
                  java.lang.String errorContext)
           throws com.exalead.util.TypedException
```
Checks this HTMLRelevantContentExtractor.

Specified by:

check in interface com.exalead.util.Checkable

Overrides:

check in class DocumentProcessor

Throws:

com.exalead.util.TypedException

accept

public <T> T accept(DocumentProcessor.Transformer<T> transformer,
                    T[] t)
             throws com.exalead.util.TypedException

Specified by:: accept in class DocumentProcessor
Throws:: com.exalead.util.TypedException

Class HTMLRelevantContentExtractor

Nested Class Summary

Nested classes/interfaces inherited from class com.exalead.indexing.analysis.v10.DocumentProcessor

Field Summary

Fields inherited from class com.exalead.indexing.analysis.v10.DocumentProcessor

Constructor Summary

Method Summary

Methods inherited from class com.exalead.indexing.analysis.v10.DocumentProcessor

Methods inherited from class java.lang.Object

Field Detail

relevantChunkContext

DEFAULT_RELEVANT_CHUNK_CONTEXT

newContextName

DEFAULT_NEW_CONTEXT_NAME

irrelevantChunkContext

DEFAULT_IRRELEVANT_CHUNK_CONTEXT

retrieveFieldContext

DEFAULT_RETRIEVE_FIELD_CONTEXT

irrelevantChunkAnnotation

minScore

DEFAULT_MIN_SCORE

minParagraphWords

DEFAULT_MIN_PARAGRAPH_WORDS

minTitleWords

DEFAULT_MIN_TITLE_WORDS

linkAllowedInTitle

DEFAULT_LINK_ALLOWED_IN_TITLE

paragraphBoost

DEFAULT_PARAGRAPH_BOOST

maxWordInLinkRatio

DEFAULT_MAX_WORD_IN_LINK_RATIO

titleBoost

DEFAULT_TITLE_BOOST

classBoost

DEFAULT_CLASS_BOOST

idsAndClassesToIgnore

idsAndClassesToKeep

keepOnlyBestChunk

DEFAULT_KEEP_ONLY_BEST_CHUNK

skipBlockquotes

DEFAULT_SKIP_BLOCKQUOTES

skipPre

DEFAULT_SKIP_PRE

keepImages

DEFAULT_KEEP_IMAGES

annotationsToCopy

Constructor Detail

HTMLRelevantContentExtractor

HTMLRelevantContentExtractor

Method Detail

withAcceptCondition

withName

withDataModelState

withFromDataModel

withDataModelClass

withDataModelProperty

withDisabled

withDisabled

setRelevantChunkContext

getRelevantChunkContext

withRelevantChunkContext

setNewContextName

getNewContextName

withNewContextName

setIrrelevantChunkContext

getIrrelevantChunkContext

withIrrelevantChunkContext

setRetrieveFieldContext

getRetrieveFieldContext

withRetrieveFieldContext

setIrrelevantChunkAnnotation

getIrrelevantChunkAnnotation

withIrrelevantChunkAnnotation

setMinScore

getMinScore

withMinScore

withMinScore

setMinParagraphWords

getMinParagraphWords

withMinParagraphWords