Getting Started with the Semantic Factory SDK

Programmer : CloudView Programmer : The Semantic Factory SDK : Getting Started with the Semantic Factory SDK

Create a Simple MOTPipe

Perform Language Detection and Word Lemmatization

Explore the Configuration File

Extract Named Entities

Use Case

This section is a tutorial describing how to create your own semantic factory through typical examples.

Create a Simple MOTPipe

Perform Language Detection and Word Lemmatization

Explore the Configuration File

Extract Named Entities

Use Case

Create a Simple MOTPipe

You can start by creating a MOTPipe that tokenizes input text and displays the result to the screen.

Important: A pipe is not thread safe. It is better to initialize one pipe for each thread, and to share resources.

You create a pipe in Java by using the LinguisticFactory.buildPipe(MOTConfig config) method, or one of the following:

• LinguisticFactory.buildPipe(String motConfigPath)

• LinguisticFactory.buildPipe(MOTConfig config, int version)

• LinguisticFactory.buildPipe(String linguisticConfig, String tokenizationConfigName, List<SemanticProcessor> processors)

• LinguisticFactory.buildPipe(String linguisticConfig, String tokenizationConfigName, List<SemanticProcessor> processors, int version)

• LinguisticFactory.buildPipe(String linguisticConfig, String tokenizationConfigName)

• LinguisticFactory.buildPipe(List<Tokenizer> tokenizers, NormalizerConfig norm, List<SemanticProcessor> processors)

• LinguisticFactory.buildPipe(List<Tokenizer> tokenizers, NormalizerConfig norm, List<SemanticProcessor> processors, int version)

Use the methods signatures with additional ResourcesContext arguments to share resources between multiple pipes. For example, LinguisticFactory.buildPipe(ResourcesContext ctx, String motConfigPath)

Once created, you must initialize the pipe using its init() method to ensure the loading of the resources. To free resources when the pipe is no longer used, call the release() method.

Create a MOTPipe in Java

Below is a code snippet that initializes the pipe named _pipe to do simple tokenization of text:

List<Tokenizer> tokenizers = new ArrayList<Tokenizer>();
tokenizers.add(new StandardTokenizer());
MOTPipe pipe = LinguisticFactory.buildPipe(tokenizers, new NormalizerConfig(),
new ArrayList<SemanticProcessor>());
System.out.println("Loading pipe...");
pipe.init();
System.out.println("Pipe loaded.");

Create a MOTPipe with an XML Configuration File

Creating a MOTPipe programmatically can be tedious task when its number of processors increases. It is more convenient to rely on an XML configuration file. For example:

<ling:MOTConfig xmlns="exa:com.exalead.linguistic.v10">

<ling:StandardTokenizer />

<ling:NormalizerConfig />


</ling:MOTConfig>

Then, you can load the MOTPipe as follows:

MOTPipe pipe = LinguisticFactory.buildPipe(pathToXMLFile);

Process the Input Text

To process the input text, declare the processing of a new document using the newDocument() method. It must be followed after processing by a call to endDocument(). These two calls can trigger some specific tasks from processors, for example, adding annotations directly at the document level.

The following snippet processes the provided string content. The language is unspecified (Language.XX), as only tokenization is done.

pipe.newDocument();
AnnotatedToken[] tokens = pipe.process(content, Language.XX);
print(tokens);
pipe.endDocument();

Code Sample

The following class uses the preceding code and displays the analysis result on the standard output stream.

package com.exalead.mot.tutorial;

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.util.ArrayList;import java.util.List;

import com.exalead.lang.Language;
import com.exalead.linguistic.LinguisticFactory;
import com.exalead.linguistic.v10.NormalizerConfig;
import com.exalead.linguistic.v10.StandardTokenizer;
import com.exalead.linguistic.v10.Tokenizer;
import com.exalead.mot.core.MOTPipe;
import com.exalead.mot.v10.AnnotatedToken;
import com.exalead.mot.v10.Annotation;

public class Step1Tokenization {
public static void main(String[] argv) throws IOException {
if (argv.length != 1) {
System.err.println("usage: java com.exalead.mot.tutorial.Step1Tokenization /path/to/mot/config.xml");
System.exit(1);
}
Step1Tokenization step1 = new Step1Tokenization();
if (!step1.init(argv[0])) {
return;
}

BufferedReader stdin = new BufferedReader(new InputStreamReader(System.in));
try {
while (true) {
System.out.print("Enter a sentence to parse: ");
step1.process(stdin.readLine());
}
} catch (IOException e) {
step1._pipe.release();
}
}
private MOTPipe _pipe;

public Step1Tokenization() {
}

public boolean init(String motConfigPath) {
try {
_pipe = LinguisticFactory.buildPipe(motConfigPath);
System.out.println("Loading pipe...");
_pipe.init();
System.out.println("Pipe loaded.");
} catch (Exception e) {
e.printStackTrace();
System.err.println("Could not load pipe: " + e.getMessage());
return false;
}
return true;
}

private void print(AnnotatedToken[] tokens) {
for (int i = 0; i < tokens.length; ++i) {
AnnotatedToken tok = tokens[i];
System.out.println(" Token[" + tok.token + "] kind[" + AnnotatedToken.nameOfKind(tok.kind) + "]
lng[" + Language.name(tok.lang) + "] offset[" + tok.offset + "]");
print(tok.annotations);
}
}
private void print(Annotation[] annotations) {
for (int j = 0; j < annotations.length; ++j) {
Annotation ann = annotations[j];
System.out.println(" Annotation[" + ann.displayForm + "] tag[" + ann.tag + "] nbTokens
[" + ann.nbTokens + "]");
}
}

public void process(String content) {
_pipe.newDocument();
AnnotatedToken[] tokens = _pipe.process(content, Language.XX);
print(tokens);
_pipe.endDocument();
}
}

When you run it, you obtain an output similar to the following.

Loading pipe...
Pipe loaded.
Enter a sentence to parse: hello world
Token[hello] kind[ALPHA] lng[xx] offset[0]
Annotation[hello] tag[LOWERCASE] nbTokens[1]
Annotation[hello] tag[NORMALIZE] nbTokens[1]
Token[ ] kind[SEP_SPACE] lng[xx] offset[5]
Token[world] kind[ALPHA] lng[xx] offset[6]
Annotation[world] tag[LOWERCASE] nbTokens[1]
Annotation[world] tag[NORMALIZE] nbTokens[1]

Perform Language Detection and Word Lemmatization

Language Detection with Java

The MOTPipe.process() method accepts language as a parameter. The tokenizer uses this language later for all the tokens it creates.

Let us redo the previous example using Language.EN instead of Language.XX. Note how the lng attribute of the tokens has changed from lng[xx] to lng[en].

Processors may support all, one or a subset of languages. A lemmatizer, for example, is dedicated to tokens in the language of its resource only.

Recommendation: Perform a detection when the language of the content is unknown, or when it may contain several languages. To do so, use the Language Detector processor.

The following snippet illustrates the creation of such a processor and its declaration in a pipe.

List<Tokenizer> tokenizers = new ArrayList<Tokenizer>();
tokenizers.add(new StandardTokenizer());
LanguageDetector languageDetector = new LanguageDetector();
languageDetector.setName("languageDetector");
List<SemanticProcessor> processors = new ArrayList<SemanticProcessor>();
processors.add(languageDetector);
pipe = LinguisticFactory.buildPipe(tokenizers, new NormalizerConfig(), processors);

With this initialization, running the program detects the language of tokens. The processor replaces the language used in the call to MOTPipe.process() with the one it detects.

Note: A sufficient number of words is required to correctly perform the detection. In a long text containing different languages, the processor gives each token its most probable language.

Language Detection with XML Configuration File

The following configuration file illustrates the creation of the processor and its declaration in a pipe.

Retrieve Document Annotations

Both tokens and documents can receive annotations. You can retrieve document annotations with a call to MOTPipe.getDocumentAnnotations() after the call to MOTPipe.endDocument() (processor usually uses this trigger to add their annotations to the document).

The Language Detector processor adds an annotation for each detected language in the document with the following snippet:

pipe.newDocument();
pipe.newField("field");
AnnotatedToken[] tokens = pipe.process(content, Language.XX);
print(tokens);
pipe.endDocument();
Annotation[] annotations = pipe.getDocumentAnnotations();
System.out.println(" Document");print(annotations);

You obtain the following result:

Enter a sentence to parse: have a good day
Token[have] kind[ALPHA] lng[en] offset[0]
Annotation[have] tag[LOWERCASE] nbTokens[1]
Annotation[have] tag[NORMALIZE] nbTokens[1]
Token[ ] kind[SEP_SPACE] lng[en] offset[4]
Token[a] kind[ALPHA] lng[en] offset[5]
Annotation[a] tag[LOWERCASE] nbTokens[1]
Annotation[a] tag[NORMALIZE] nbTokens[1]
Token[ ] kind[SEP_SPACE] lng[en] offset[6]
Token[good] kind[ALPHA] lng[en] offset[7]
Annotation[good] tag[LOWERCASE] nbTokens[1]
Annotation[good] tag[NORMALIZE] nbTokens[1]
Token[ ] kind[SEP_SPACE] lng[en] offset[11]
Token[day] kind[ALPHA] lng[en] offset[12]
Annotation[day] tag[LOWERCASE] nbTokens[1]
Annotation[day] tag[NORMALIZE] nbTokens[1]
Document
Annotation[en] tag[language] nbTokens[0]

Lemmatization with Java

Lemmatization is a common linguistic task that consists in identifying the lemma of each word using a language dictionary.

To enable lemmatization in both English and French, we need to add two dedicated processors. To do so, complete the pipe creation as shown in Language Detection with Java by the following snippet:

Lemmatizer englishLemmatizer = new Lemmatizer();
englishLemmatizer.setName("englishLemmatizer");
englishLemmatizer.setLanguage("en");
processors.add(englishLemmatizer);
Lemmatizer frenchLemmatizer = new Lemmatizer();
frenchLemmatizer.setName("frenchLemmatizer");
frenchLemmatizer.setLanguage("fr");
processors.add(frenchLemmatizer);

The lemmatizers then enrich the tokens generated. The result looks like the output below:

Token[books] kind[ALPHA] lng[en] offset[65]
Annotation[books] tag[LOWERCASE] nbTokens[1]
Annotation[books] tag[NORMALIZE] nbTokens[1]
Annotation['book', 'book', 'n', 'p', 'book', 'book', 'noun'] tag[lemmainformation] nbTokens[1]

Here a new annotation is added for the token books. It corresponds to an array containing the following information: [lowercase masculine singular, normalized masculine singular, genre, number, lowercase singular, normalized singular, category]. You can access it in a more convenient way using the Lemmatizer.deserialize() method, which returns a LemmaInformation instance that provides accessors.

You can add several annotations if there is any ambiguity. For example, use a Part of Speech Tagger processor next to disambiguate.

Lemmatization with an XML Configuration File

<ling:MOTConfig xmlns="exa:com.exalead.linguistic.v10">

<ling:StandardTokenizer />


<ling:NormalizerConfig />


<ling:LanguageDetector name="languageDetector" />
<ling:Lemmatizer name="englishLemmatizer" language="en" />
<ling:Lemmatizer name="frenchlemmatizer" language="fr" />
</ling:MOTConfig>

Explore the Configuration File

Each item in the configuration file may accept many parameters.

<ling:MOTConfig xmlns="exa:com.exalead.linguistic.v10">

<ling:StandardTokenizer >
<ling:charOverrides>
<ling:StandardTokenizerOverride type="token" toOverride=":" />
</ling:charOverrides>
<ling:patternOverrides>
<ling:StandardTokenizerOverride type="token" toOverride="[[:alnum:]][&][[:alnum:]]" />
<ling:StandardTokenizerOverride type="token" toOverride="[[:alnum:]]*[.]net" />
<ling:StandardTokenizerOverride type="token" toOverride="[[:alnum:]]+[+]+" />
<ling:StandardTokenizerOverride type="token" toOverride="[[:alnum:]]+#" />
</ling:patternOverrides>
</ling:StandardTokenizer>


<ling:NormalizerConfig>
<ling:NormalizerIndexLower language="fr" word="thé" />
<ling:NormalizerIndexLower language="fr" word="maïs" />
</ling:NormalizerConfig>


<ling:Lemmatizer name="englishLemmatizer" language="en" />
<ling:Lemmatizer name="frenchLemmatizer" language="fr" />
</ling:MOTConfig>

You can see that the Standard Tokenizer accepts specific overrides:

• For characters: the : character is considered as an alphabetical character and not punctuation. For example, a:b is only one token.

• For patterns: any sequence of characters that match the specified regexp are considered as a whole token. For example, R&D that matches the first pattern.

You can also configure the Normalizer with normalization exceptions. Here, the forms thé, Thé, or THÉ are all normalized to thé and not the.

Extract Named Entities

You can extract more complex information such as Named Entities.

Note: Dependencies that are not declared in the configuration are implicitly included during initialization.

A dedicated processor is available for this task: the NamedEntitiesMatcher. For more information, see the CloudView Configuration Reference and "Named Entities Matcher" in the Exalead CloudView Configuration Guide .

To use it, configure it in the XML configuration file as follows:

Required dependencies are automatically added during initialization (RelatedTerms and recursively PartOfSpeechTagger). You do not need to include them explicitly in the configuration.

The NamedEntitiesMatcher processor adds named entities annotations on the tokens.

Note: Annotations are added only on the first tokens composing the entities.

The nbTokens property indicates the scope of the annotation. The result looks like the output below:

Token[Barack] kind[ALPHA] lng[xx] offset[0]
Annotation[barack] tag[LOWERCASE] nbTokens[1]
Annotation[barack] tag[NORMALIZE] nbTokens[1]
Annotation[barack obama] tag[relatedTerm] nbTokens[3]
Annotation[Barack Obama] tag[relatedTermDisplay] nbTokens[3]
Annotation[Barack Obama] tag[exalead.people] nbTokens[3]
Annotation[] tag[exalead.nlp.firstnames] nbTokens[1]
Annotation[famouspeople] tag[NE] nbTokens[3]
Annotation[1] tag[sub] nbTokens[3]
Annotation[Barack Obama] tag[NE.famouspeople] nbTokens[3]
Token[ ] kind[SEP_SPACE] lng[xx] offset[6]
Token[Obama] kind[ALPHA] lng[xx] offset[7][...]

Here is a summary of the tag values:

• relatedTerm = noun phrase in a lemmatized + normalized form

• relatedTermDisplay = noun phrase full-text form

• sub = internal indicator that you can ignore

• exalead.nlp.firstnames = internal indicator (presence of a first name)

• exalead.people = internal indicator (entity present in the Exalead persons dictionary built from wikipedia/freebase/dbpedia)

• NE and NE.famouspeople = result of named entity detection under two different forms based on the specified prefix (that is, "NE" in our example)

The last annotation is the most useful. It contains the canonical form of the entity as well as a tag specifying its type. You can rely on the tag prefix (defined in the configuration file) to locate named entities annotations. When possible, the content of the annotation is a canonical form of the entity that is specified in a thesaurus (for example, "United States" and "U.S." are normalized to "USA"), or inferred with linguistic rules.

Use Case

This use case describes how to extract vehicle registration plates.

For such a task, a simple RulesMatcher processor is adequate. This processor can match patterns expressed by rules against a token stream. Rules are described in a dedicated XML configuration file. For the syntax description, see "Rules Matcher (rule-based)" in the Exalead CloudView Configuration Guide .

Example: the following sample shows how to extract French vehicle registration plates.

To use it, add a RulesMatcher in the configuration file:

In this example, we used the Exalead resource protocol. It implies that the resource is relative to the NGRESOURCEPATH directory (see Java Project Requirements). The file protocol is also supported and you can use it to specify an absolute path. The result looks like the output below.

Token[AA] kind[ALPHA] lng[fr] offset[0]
Annotation[aa] tag[LOWERCASE] nbTokens[1]
Annotation[aa] tag[NORMALIZE] nbTokens[1]
Annotation[AA-123-BB] tag[NE.plates.SIV] nbTokens[5]
Token[-] kind[DASH] lng[fr] offset[2]
Token[123] kind[NUMBER] lng[fr] offset[3][...]