XML Configuration Reference : Index : NativeTextExtractor
 
NativeTextExtractor
com.exalead.indexing.analysis.v10.NativeTextExtractor
Extraction is performed for the following data types:
text/plain for Text files.
text/html for HTML Files.
application/x-exalead-document for CloudView 4.6 document format (com.exalead.document)
application/x-exalead-ndoc for CloudView 5 internal document format, binary.
application/x-exalead-ndoc-v10+xml for CloudView internal document format, XML.
@csh AC_TEXTEXTRACTOR_HTML_ID
Parent elements:
com.exalead.indexing.analysis.v10.AnalysisPipeline (as AnalysisPipeline)
com.exalead.indexing.analysis.v10.DocumentProcessorGroup (as DocumentProcessorGroup)
Attributes:
Name
Type
Default value
Description
name
string
Name of this processor. The name of a processor is used only for tracing and debugging purposes.
dataModelState
string
Is this document processor managed by a data model? @enum{null,auto,customized, error}.
If null, this document processor is not related to a data model.
If "auto", this document processor is auto-generated by a data model.
If "customized", this document processor was auto-generated by a data model and then customized.
If "error", there is a conflict between this document processor and the data model.
dataModelClass
string
If dataModelState is either "auto" or "customized", you will find here the name of the DataModelClass that generated this DocumentProcessor.
dataModelProperty
string
If dataModelState is either "auto" or "customized", you will find here the name of the DataModelProperty that generated this DocumentProcessor.
disabled
boolean
Disable the DocumentProcessor
annotateHTML
boolean
Adds some stylish annotations to DocumentChunks (for HTML files only):
html:p for DocumentChunks generated from <p>
html:row for DocumentChunks generated from <tr>
html:column for DocumentChunks generated from <td> or <th>
html:table for DocumentChunks generated from <table>
html:h1 for DocumentChunks generated from <h1>
html:h2 for DocumentChunks generated from <h2>
html:h3 for DocumentChunks generated from <h3>
html:h4 for DocumentChunks generated from <h4>
html:h5 for DocumentChunks generated from <h5>
html:h6 for DocumentChunks generated from <h6>
html:link for DocumentChunks generated from <a>, <iframe> or <frame>
html:link:rel if the link has a "rel" attribute
html:link:name if the link has a "name" attribute
html:list for DocumentChunks generated from <ul>, <ol> or <dl>
html:item for DocumentChunks generated from <li>
html:bold for DocumentChunks generated from <b> or <strong>
html:italic for DocumentChunks generated from <i> or <em>
html:underline for DocumentChunks generated from <u>
html:strike for DocumentChunks generated from <s> or <strike>
html:pre for DocumentChunks generated from <pre>
html:invisible for DocumentChunks containing invisible text (display: none, white on white)
html:class for DocumentChunks taken in a CSS class
html:id for DocumentChunks taken in a CSS id
html:img:src for DocumentChunks created from a <img>
It also creates specific HTML DocumentChunks with the following contexts:
html:lang when parsing a <html> containing the "lang" attribute
html:xml:lang when parsing a <html> containing the "xml:lang" attribute
html:title when parsing a <title>
html:title:other when parsing a second <title>
html:base:href when parsing a <base>
html:link when parsing a <link> containing the "src" attribute and annotated by:
html:link:rel if the link has a "rel" attribute
html:link:type if the link has a "type" attribute
html:http-equiv:NAME when parsing a http-equiv meta
html:meta:NAME when parsing a meta named "NAME"
skipInvisibleHTMLText
boolean
Skips the invisible text. For example, white fonts on white backgrounds (for HTML files only).
extractJs
boolean
Tries to parse JavaScript and then extract links.
extractHTMLTables
boolean
Adds annotations on table, tr, td, th
extractHTMLStyles
boolean
Adds annotations on style attributes.
extractHTMLForms
boolean
Add annotations on Forms, select.
maxHTMLAnnotationDepth
int
20
Prevents new annotations from being created after @c maxHTMLAnnotationDepth HTML level.
disableAutomaticHTMLDTDFix
boolean
Disables automatic DTD fix on HTML documents.
Nested elements:
Name
Type
Description
fromDataModel
com.exalead.indexing.analysis.v10.DocumentProcessor
If dataModelState is "customized", you will find here the original document processor generated by the data model. Use this to easily revert to "auto" state from "customized". @IgnoreForValueConstructor
AcceptCondition
com.exalead.indexing.analysis.v10.AcceptCondition
Expresses the enablement condition of this DocumentProcessor.