Writing Custom Tokenizers and Semantic Processors

Programmer : CloudView Programmer : The Semantic Factory SDK : Writing Custom Tokenizers and Semantic Processors

Exalead CloudView is delivered with a vast number of semantic processors that can alter documents in analysis pipelines. You can perform most analysis tasks by assembling these processors. However, for advanced and custom operations, it may be more convenient to write custom semantic processors.

You can:

• Replace the analysis pipeline tokenizer with a custom one written in Java.

• Add a custom semantic processor at the end of the analysis pipeline.

Both are implemented as custom document processors, so make sure that you are acquainted with the proper way to develop and deploy them on a Exalead CloudView instance.

For information on how to build and deploy with the Eclipse plugin, see "Develop and deploy components using the Eclipse plugin".

About Tokens and Annotations

Write a Java Custom Tokenizer

Write a Java Custom Semantic Processor

About Tokens and Annotations

Tokenizers and semantic processors work on a stream of annotated tokens:

• Tokenizers produce them from the input text,

• And processors add and remove them according to the algorithm they implement.

Create Tokens

Basically, a token is the aggregation of:

• A form (a piece of the input text),

• A type (alphabetical, punctuation, separator, etc.),

• And an array of annotations.

When developing a tokenizer, you have to create tokens and to define several of the following fields. The framework defines some of them automatically.

package com.exalead.mot.v10;

public class AnnotatedToken
{
/// The token value (form)
public String token;

/// The token kind (type)
public int kind;

/// The language code (defined by com.exalead.lang.Language)
public int lang;

/// The position in the original text of this token (in terms of characters)
public int offset;

/// The list of annotations attached to this token
public Annotation[] annotations;

/// Returns the XML representation of this token and its annotations
public String toString();

/// Returns the annotations attached to this token having the given tag
/// If none, returns the empty list.
public List<Annotation> getAnnotationsWithTag(String tag);

/// Returns the annotations having any of the given tags.
/// If none, returns the empty list.
public List<Annotation> getAnnotationsWithTags(Collection<String> tags);

/// Returns the annotations having the given tag and display form.
/// If none, returns the empty list.
public List<Annotation> getAnnotationsWithTagAndDisplay(String tag, String display);

/// Token kinds and their default interpretation
public final static int TOKEN_UNKNOWN = 0; /// unknown
public final static int TOKEN_SEP_IGNORE = 1; /// separator [[:ctrl:]]
public final static int TOKEN_SEP_SPACE = 2; /// space [[:space:]]
public final static int TOKEN_SEP_SENTENCE = 4; /// sentence
public final static int TOKEN_SEP_PARAGRAPH = 8; /// paragraph (\n\n)
public final static int TOKEN_SEP_QUOTE = 16; /// quote ["']
public final static int TOKEN_SEP_DASH = 32; /// dash [-]
public final static int TOKEN_SEP_PUNCT = 64; /// punct [[:punct:]]
public final static int TOKEN_NUMBER = 128; /// number [0-9]+
public final static int TOKEN_ALPHANUM = 256; /// alphanum [a-zA-Z0-9]+
public final static int TOKEN_ALPHA = 512; /// alpha [a-zA-Z]+
}

Create Annotations

Annotations are pairs of key/value strings (tag and display form) of a certain length, expressed in tokens.

In addition, an arbitrary integer value between [0, 100] is associated (trust level). Its semantics is left to the implementors of algorithms.

Here is the definition of com.exalead.mot.v10.Annotation:

package com.exalead.mot.v10;

public class Annotation
{
/// The tag (key)
public String tag;

/// The display form (value)
public String displayForm;

/// Length of this annotation
public int nbTokens;

/// The trust level
public int trustLevel;

/// XML representation of this annotation
public String toString();
}

Write a Java Custom Tokenizer

A Java Custom Tokenizer is useful for processing the text with an external analyzer or for implementing a specific behavior. The JavaCustomTokenizer allows you to write your own code for splitting the input and possibly adding annotations to the produced tokens.

These tokens then follow their way in the indexing chain as usual (see Sample Tokenizer ).

Write a Java Custom Tokenizer

If you derive your MyTokenizer class from com.exalead.pdoc.analysis.JavaCustomTokenizer, you have to implement at least the following:

@PropertyLabel(value = "JavaCustomTokenizer")
@CVComponentConfigClass(configClass = MyTokenizerConfig.class)
@CVComponentDescription(value = "My tokenizer in Java")

public class MyTokenizer extends com.exalead.pdoc.analysis.JavaCustomTokenizer
{
/**
* CustomDocumentProcessor requires a constructor accepting a custom configuration
* This constructor must call JavaCustomTokenizer constructor
*/
public MyTokenizer(MyTokenizerConfig config);

/**
* Called when a new document is about to get processed.
*/ public void newDocument();

/**
* Called when there is no more input to process in the current document.
* This is the last chance to attach annotations to the document if needed.
*/ public void endDocument();

/** * Called when a new input chunk is to be processed.
* The processor must:
* - produce tokens from the text using newToken() and newAnnotation()
* - send the said tokens to the semantic pipe with pushToken()
*
* @param text the chunk text
* @param language the chunk language
* @param context the chunk context
* @throws InvalidTokenException
* @see newToken(), newAnnotation(), pushToken()
* @post Concatenation of the token forms must be strictly equal to the input string
*/
public void processChunk(String text, int language, String context) throws Exception;

/**
* Called at initialization to retrieve the annotation tags that are planned to be produced during
tokenization.
* @return the list of all annotation tags needed or null if none
*/
public String[] declareAnnotations();
}

The JavaCustomTokenizer provides a handful of helper methods:

package com.exalead.pdoc.analysis;

public abstract class JavaCustomTokenizer extends CustomDocumentProcessor
{
...
/**
* Allocate a new token of the provided form.
* The token is either created or recycled from a previous use.
* The token type and language are deduced from the form and the chunk input language
(they can be overridden though).
* @param form the new token form
* @return a fresh or recycled token
* @pre form is not null
* @pre form is not empty
* @post token kind is set using default typing
*/
protected AnnotatedToken newToken(String form) throws InvalidTokenException;
/**
* Allocate a new annotation with the provided tag, value and length.
* The annotation is either created or recycled from a previous use.
* @param tag the new annotation tag
* @param displayForm the new annotation value
* @param nbTokens the new annotation length
* @return a fresh or recycled annotation
* @pre tag must have been declared in declareAnnotation()
* @pre displayForm is not null
* @pre nbTokens > 0
*/
protected Annotation newAnnotation(String tag, String displayForm, int nbTokens) throws
InvalidAnnotationException;
/**
* Send a token to the output stream.
*
* - validity of the token is checked
* - the token is added to the output buffer
* - if needed, the output buffer is flushed
* - the token is recycled
* In all cases, the token and its annotations are no longer usable after the call.
* @param token A token allocated through a call to newToken()
* @pre token is not null
* @pre token form is not null nor empty
* @pre token type is defined
* @see newToken(), newAnnotation()
*/
protected void pushToken(AnnotatedToken token) throws InvalidTokenException;
/**
* Attach an annotation to the currently processed document after checking its validity.
* @param annotation the annotation to attach
* @pre annotation must have been allocated with newAnnotation()
* @see newAnnotation()
*/
protected void addDocumentAnnotation(Annotation annotation) throws InvalidAnnotationException;
...
}

Caveats for Tokenizers

• When creating new tokens, you have to specify their form (attribute token) and possibly their annotation set. The kind, language, and offset are automatically defined. You may want to override the kind if the default typing does not suit your needs. Be careful with the token kind, as it has a huge impact on the way a token is indexed (or not). If a token is malformed, newToken() or pushToken() throw InvalidTokenException.

• When creating new annotations, you have to specify their tag, display form, number of tokens, and possibly their trust level. If an annotation is malformed, newAnnotation() or addDocumentAnnotation() throw an InvalidAnnotationException.

• Do not allocate annotated tokens and annotations by yourself, always use newToken() and newAnnotation() as they are pooled and recycled once pushed to the pipeline, to save RAM.

• Since tokens and annotations are recycled, they are not usable anymore once pushed to the pipeline. Request a new token/annotation through newToken()/newAnnotation() if required.

• Avoid allocating too many tokens and annotations before pushing them to the pipeline. Ideally, to guarantee optimal RAM consumption, push a token before allocating a new one.

• Your code is executed in a multi-threaded environment. Each thread has its own tokenizer so that your code does not have to be thread safe. However, threads share static objects, and this may lead to issues in case of concurrent modifications.

Sample Tokenizer

This sample demonstrates how to write a custom tokenizer that:

• Splits the input text into alphabetical, numerical, punctuation, or blank tokens.

• Annotates with "Capitalized" each token that starts with an upper-case letter.

• Annotates with "Number" each token made of digits.

• Pushes the produced tokens to the semantic pipeline.

package com.exalead.customcode.analysis;

/**
* This processor can be configured with:
* <CustomDocumentProcessor classId="com.exalead.customcode.analysis.CustomTokenizer">
* <KeyValue key="Meta" value="mymeta" />
* </CustomDocumentProcessor>
*/
@CVComponentConfigClass(configClass = com.exalead.customcode.analysis.CustomTokenizer.
CustomTokenizerConfig.class)
@CVComponentDescription(value = "My tokenizer in Java")

public class CustomTokenizer extends JavaCustomTokenizer
{
public static class CustomTokenizerConfig implements CVComponentConfig
{
private String meta;
public String getMeta() {
return meta;
}
@IsMandatory(true)
public void setMeta(String meta) {
this.meta = meta;
}
}
public CustomTokenizer(CustomTokenizerConfig config) throws Exception {
super(config);
}
@Override
public void newDocument() {
logger.debug("I'm about to start tokenizing a new document!");
}
@Override
public void endDocument() {
logger.debug("Done with this document!");
}
@Override
public void processChunk(String text, int language, String context) throws Exception {
logger.debug("Tokenizing [" + text + "] in context [" + context + "] with language [" + Language.name
(language) + "]");
Matcher matcher = pattern.matcher(text);
while (matcher.find()) {
AnnotatedToken token = newToken(text.substring(matcher.start(), matcher.end()));
if (token.token.matches("^\\p{Lu}.*$")) {
Annotation[] a = { newAnnotation("Capitalized", token.token.toLowerCase(), 1) };
token.annotations = a;
} else if (token.token.matches("^[0-9]+$")){
Annotation[] a = { newAnnotation("Number", token.token.toLowerCase(), 1) };
token.annotations = a;
}
pushToken(token);
}
}
@Override
public String[] declareAnnotations() {
return new String[] { "Capitalized", "Number" };
}

private static Pattern pattern = Pattern.compile("\\p{L}+|\\p{N}+|\\p{Z}+|\\p{P}|[^\\p{L}\\p{N}\\p{Z}\\
p{P}]+");
private Logger logger = Logger.getLogger("mycustomtokenizer");
}

Write a Java Custom Semantic Processor

The Java custom semantic processor allows you to plug your code as the last semantic processor in the pipeline.

You use as input a flow of annotated tokens from the pipeline, have an opportunity to add or remove any annotation, and then send the tokens back to the indexing chain.

Note: For more information, see "About Semantic Processors" in the Exalead CloudView Configuration Guide .

Write a Java Custom Semantic Processor

The difference with the Java Custom Tokenizer is in the input:

• The tokenizer receives a text chunk to process.

• While for the Java Custom Semantic Processor, you have to get the tokens from the pipeline (see Sample Semantic Processor).

Derive your MySemanticProcessor class from com.exalead.pdoc.analysis.JavaCustomSemanticProcessor and implement:

@PropertyLabel(value = "JavaCustomSemanticProcessor")
@CVComponentConfigClass(configClass = MySemanticProcessorConfig.class)
@CVComponentDescription(value = "My semantic processor in Java")

public class MySemanticProcessor extends com.exalead.pdoc.analysis.JavaCustomSemanticProcessor
{
public MySemanticProcessor(MySemanticProcessorConfig config) throws Exception;

/**
* Called when a new document is about to get processed.
*/
public void newDocument();

/**
* Called when there is no more input to process in the current document.
* This is the last chance to attach annotations to the document if needed.
*/
public void endDocument();

/**
* Called at initialization to retrieve the annotation tags that are planned to be used during processing.
* Only declared annotations will be accessible on tokens retrieved with getNextToken().
* @return the list of all annotation tags needed or null if none
*/
public String[] declareAnnotations();

/**
* Called when a new input chunk is to be processed.
* The processor must pump tokens from pipe using getNextToken()
* and return them once processed to the pipe with pushToken().
*
* @param text the chunk text
* @param language the chunk language
* @param context the chunk context
* @see getNextToken(), newAnnotation(), pushToken()
*/
public void processChunk(String chunk, int language, String context) throws Exception;
}

The JavaCustomSemanticProcessor provides you with helpers too:

package com.exalead.pdoc.analysis;

public abstract class JavaCustomSemanticProcessor extends CustomDocumentProcessor
{
...
/**
* Pump the next token from the input stream.
* @return the next token from the pipe or null if end of input is reached
*/
protected final AnnotatedToken getNextToken();
/**
* Allocate a new annotation with the provided tag, value and length.
* The annotation is either created or recycled from a previous use.
*
* @param tag the new annotation tag
* @param value the new annotation value
* @param nbTokens the new annotation length
* @return a fresh or recycled annotation
* @pre tag must have been declared in declareAnnotation()
* @pre value is not null
* @pre nbTokens > 0
*/
protected final Annotation newAnnotation(String tag, String displayForm, int nbTokens) throws
InvalidAnnotationException;
/**
* Send a token to the output stream.
*
* - validity of the token is checked
* - the token is added to the output buffer
* - if needed, the output buffer is flushed
* - the token is recycled
*
* In all cases, the token and its annotations are not usable anymore after the call.
*
* @param token A token allocated through a call to newToken()
* @pre token is not null
* @pre token form is not null nor empty
* @pre token type is defined
* @see newToken(), newAnnotation()
*/
protected final void pushToken(AnnotatedToken token) throws InvalidTokenException;
/**
* Attach an annotation to the currently processed document
*
* @param annotation the annotation to attach
* @pre annotation must have been allocated with newAnnotation()
* @see newAnnotation()
*/
protected final void addDocumentAnnotation(Annotation annotation) throws InvalidAnnotationException;
...
}

Caveats for Semantic Processors

• When creating new annotations, you have to define their tag, display form, number of tokens and possibly their trust level. If the annotation is malformed annotation, newAnnotation() or addDocumentAnnotation() throw an InvalidAnnotationException.

• To remove an annotation from a token, assign it a null value in the annotations[] array.

• For the custom tokenizer, do not allocate annotations by yourself but always use newAnnotation() as to save RAM, annotations are pooled and recycled once pushed to the pipeline.

• Since tokens and annotations are recycled, they are not usable anymore once pushed to the pipeline. The only way to get a new token is through getNextToken(). Always allocate annotations through newAllocation().

• Keep as few tokens in memory as possible before pushing them back to the pipeline. Ideally, a token must be pushed before getting the next one from the pipeline.

• Your code is executed in a multi-threaded environment, where each thread has its own processor, so that your code does not have to be thread-safe. However, threads share static objects, which can possibly lead to trouble in case of concurrent modifications.

Sample Semantic Processor

This sample demonstrates how to write a custom semantic processor that:

• Gets tokens from the semantic pipeline

• Looks for a first name that is part of its dictionary

• Checks if the following word starts with a capitalized letter

• Add an annotation people if there is not an annotation NE.people already.

• Pushes tokens back to the pipeline

package com.exalead.customcode.analysis;

/**
* This processor can be configured with:
* <CustomDocumentProcessor classId="com.exalead.customcode.analysis.CustomSemanticProcessor">
* <KeyValue key="Meta" value="mymeta" />
* </CustomDocumentProcessor>
*/

@CVComponentConfigClass(configClass = com.exalead.customcode.analysis.CustomSemanticProcessor.
CustomSemanticProcessorConfig.class)
@CVComponentDescription(value = "My semantic processor in Java")

public class CustomSemanticProcessor extends JavaCustomSemanticProcessor
{
public static class CustomSemanticProcessorConfig implements CVComponentConfig
{
private String meta;
public String getMeta() {
return meta;
}
@IsMandatory(true)
public void setMeta(String meta) {
this.meta = meta;
}
}
public CustomSemanticProcessor(CustomSemanticProcessorConfig config) throws Exception {
super(config);
for(String s : firstNames) {
names.add(s);
}
}

@Override
public void newDocument() {
logger.debug("I'm about to start processing a new document!");
}

@Override
public void endDocument() {
logger.debug("Done with this document!");
}

@Override
public String[] declareAnnotations() {
return new String[] { "people" };
}

@Override
public void processChunk(String chunk, int language, String context) throws Exception {
for (AnnotatedToken token = getNextToken(); token != null; token = getNextToken()) {
if (token.kind == AnnotatedToken.TOKEN_ALPHA && names.contains(token.token) &&
token.getAnnotationsWithTag("NE.people").isEmpty()) {
AnnotatedToken next = getNextToken();
if (next != null) {
if (next.kind == AnnotatedToken.TOKEN_SEP_SPACE) {
AnnotatedToken next2 = getNextToken();
if (next2 != null) {
if (next2.kind == AnnotatedToken.TOKEN_ALPHA && next2.token.matches
("\\p{Lu}.+")) {
Annotation annotation = newAnnotation("people", token.token
+ " " + next2.token, 3);
token.annotations = (Annotation[]) ArrayUtils.add
(token.annotations, annotation);
}
pushToken(token);
pushToken(next);
pushToken(next2);
continue;
}
}
pushToken(token);
pushToken(next);
continue;
}
}
pushToken(token);
}
}

private Logger logger = Logger.getLogger("custom-semantic-processor");
private static String[] firstNames = { "John", "Bill", "Steve", "Robert", "George", "William",
"Frank" };
private HashSet<String> names = new HashSet<String>();
}