Configuration : Appendix - Configure Document Processors : Text Operations
 
Text Operations
 
Concatenate Values
Content Cleanup
Language Detector
Language Setter
Print Values
Replace Regexp
Split Values
String Hash
String Transform
Concatenate Values
Concatenates the textual content of DocumentChunks which context names match what is specified in Input from, and joins them with the 'join' string.
A single DocumentChunk with the 'outputContext' context name is created as output.
Content Cleanup
Analyzes each DocumentChunk and performs white space removal. The Unicode specification defines 'white spaces'.
This includes ' ' '\\r' and '\\n'
Input: All DocumentChunks associated with the context names specified as input.
Output: Same as input.
Language Detector
Language detection uses the text of all DocumentChunks associated to the context names specified as input, for which language was not already detected or specified.
The whole text of all these DocumentChunks is taken into account by a statistical algorithm that detects the language. This language is then set as the language for all specified chunks.
For example, the language attribute of a DocumentChunk is used by semantic processing.
Language is represented by its iso639-1 code: fr, en.
Language Setter
The language is set as the language for all the DocumentChunks associated with the context names specified as input.
For example, the language attribute of a DocumentChunk is used by semantic processing.
The language is represented by its iso639-1 code: fr, en
Print Values
Prints textual content of DocumentChunks according to a formatting string.
This string contains variables in one of the 3 following formats:
$(name), the name of a context: output is the textual content of this context.
$/name:regexp/, the name of a context whose chunks must match the regexp: output is the piece of text that has matched.
$/name:regexp:format/, the name of a context whose chunks must match the regexp: output is defined by a sed-like format referencing the regexp subexpressions.
Important: In the regexp and format parts, use a backlash to avoid colons and slashes.
For example: $(firstname) $(lastname) : $/age:[0-9]+/ $/date:([0-9]{2})([0-9]{2})([0-9]{4}):day=\\\\1 month=\\\\2 year=\\\\3
Important: The context used in this method cannot be produced by another processor. It must come from the connector.
Replace Regexp
Substitutes the content substring of all DocumentChunks having the ContextName 'inputContext', using:
Pattern as the matching substring regular expression
and Replacement value, which may have a sed-like output format using references to capture \\0 through \\9.
A new DocumentChunk is created with the substitutions.
Split Values
Splits the textual content of all DocumentChunks containing the context name defined in Input from using the specified separator as a separator regular expression.
A new DocumentChunk is created for each segment, with 'outputContext' as the ContextName.
String Hash
Computes a signed hash of the textual input value on 32 bits.
For example, you can use this value in a field used for grouping.
String Transform
Applies textual transformations on chunks from several contexts:
trims blanks at the beginning and at the end of chunks
reduces sequences of blanks to only one
changes text to uppercase/lowercase/normalized/capitalized
Outputs replace inputs.