Text Operations

Language detection uses the text of all DocumentChunks associated to the context names specified as input, for which language was not already detected or specified.

The whole text of all these DocumentChunks is taken into account by a statistical algorithm that detects the language. This language is then set as the language for all specified chunks.

For example, the language attribute of a DocumentChunk is used by semantic processing.

Language is represented by its iso639-1 code: fr, en.

Language Setter

The language is set as the language for all the DocumentChunks associated with the context names specified as input.

For example, the language attribute of a DocumentChunk is used by semantic processing.

The language is represented by its iso639-1 code: fr, en

Print Values

Prints textual content of DocumentChunks according to a formatting string.

This string contains variables in one of the 3 following formats:

• $(name), the name of a context: output is the textual content of this context.

• $/name:regexp/, the name of a context whose chunks must match the regexp: output is the piece of text that has matched.

• $/name:regexp:format/, the name of a context whose chunks must match the regexp: output is defined by a sed-like format referencing the regexp subexpressions.

Important: In the regexp and format parts, use a backlash to avoid colons and slashes.

For example: $(firstname) $(lastname) : $/age:[0-9]+/ $/date:([0-9]{2})([0-9]{2})([0-9]{4}):day=\\\\1 month=\\\\2 year=\\\\3

Important: The context used in this method cannot be produced by another processor. It must come from the connector.

Replace Regexp

Substitutes the content substring of all DocumentChunks having the ContextName 'inputContext', using:

• Pattern as the matching substring regular expression

• and Replacement value, which may have a sed-like output format using references to capture \\0 through \\9.

A new DocumentChunk is created with the substitutions.

Split Values

Splits the textual content of all DocumentChunks containing the context name defined in Input from using the specified separator as a separator regular expression.

A new DocumentChunk is created for each segment, with 'outputContext' as the ContextName.

String Hash

Computes a signed hash of the textual input value on 32 bits.

For example, you can use this value in a field used for grouping.

String Transform

Applies textual transformations on chunks from several contexts:

• trims blanks at the beginning and at the end of chunks

• reduces sequences of blanks to only one

• changes text to uppercase/lowercase/normalized/capitalized

Outputs replace inputs.