Rules Matcher (Rule-Based)

Boolean Operators	Description
Boolean OR	An Or matches if at least one of its sub expressions matches. The length of the annotation matched is the longest of the sub expression matches.
Boolean AND	An And matches a token if all its sub expressions match. The length of the annotation matched is the longest of all sub expression matches.
Boolean NOT	A Not matches a token if its sub expression does not match. The length of the annotation matched is 1.
Boolean NOR	A Nor matches a token for the combined Boolean operators Not Or. The length of the annotation matched is 1.

Boolean Operators

Description

Boolean OR

An Or matches if at least one of its sub expressions matches.

The length of the annotation matched is the longest of the sub expression matches.

Boolean AND

An And matches a token if all its sub expressions match.

The length of the annotation matched is the longest of all sub expression matches.

Boolean NOT

A Not matches a token if its sub expression does not match.

The length of the annotation matched is 1.

Boolean NOR

A Nor matches a token for the combined Boolean operators Not Or.

The length of the annotation matched is 1.

Boolean Atoms	Description
TokenRegexp	A TokenRegexp matches if the exact anchored token string is matched. This is the default behavior. It is, however, possible to define the match as normalized or case-insensitive. The following regexp expressions are not implemented: • assertions like \b, \B, ?=, ?!, ?<=, ?<! • back references \1, \2, ... • support for UNICODE like \u0020 or \p{name} • nongreedy repeat operators like ??, *?, +? • octal notation like \0333 For example, <TokenRegexp value="0?[1-9]\|[12][[:digit:]]\|3[01]"/>
Word	A Word matches if its value matches the normalized form of the token string. This is the default behavior. It is, however, possible to define the match as "exact" or "case-insensitive". For example, <Word value="-" level="exact"/>
Annotation	An Annotation matches if the token bears an annotation matching the specified kind and possibly a nonrequired value. For example, <And> <Annotation kind="some"/> <Annotation kind="other"/> </And>
Path	A Path matches a path value in an ontology. The implementation relies on annotations emitted by an OntologyMatcher somewhere upstream in the analysis pipeline.
AnyToken	AnyToken matches any token.
Noblank	An assertion matching a nonblank token. Its use is restricted to the root of a Boolean expression.
Digit	A Digit matches a token whose kind is TOKEN_NUMBER (set by the tokenizer for tokens made of a sequence of one or more digits). This is semantically equivalent to using the regular expression [0-9]+ but is more efficient since the work has already been done by the tokenizer.
Alpha	Alpha matches a token made of uppercase or lowercase letters.
Alnum	Alnum matches a token made of uppercase/lowercase letters or digits.
Paragraph	paragraph matches a token whose kind is TOKEN_SEP_PARAGRAPH (set by the tokenizer).
TokenLanguage	<TokenLanguage> matches a token with a specific language id. This allows you to write rules, which are triggered for certain languages only.
Punct	A Punct matches a token whose kind is TOKEN_SEP_PUNCT (set by the tokenizer).
Dash	A Dash matches a token whose kind is TOKEN_SEP_DASH (set by the tokenizer).
Sentence	A Sentence matches a token whose kind is TOKEN_SEP_SENTENCE (set by the tokenizer).
TokenKind	A TokenKind matches a token whose kind matches the specified value (set by the tokenizer). Allowed values are: • SEP_PARAGRAPH • SEP_SENTENCE • SEP_PUNCT • SEP_QUOTE • SEP_DASH • NUMBER • ALPHANUM. Note, this means alphabetical and numerical, not alphabetical or numerical (use <Alnum> instead) • ALPHA

Operators	Description
Concatenation	<Seq> is a concatenation pattern.
Disjunction	<Or> is a disjunction pattern.
Proximity	<Near> matches subpatterns at a maximum distance of n nonblank tokens. By default, the order is free but may be imposed by setting the Boolean attribute ordered to true. This pattern matches the longest possible match. Use the slop attribute to set the number of nonblank tokens allowed between A and B (default 0).
Option	<Opt> matches its subpattern zero or one time.
Bounded Repetition	<Iter> matches the sequence of its subpatterns between min and max times. The maximum is 128. <Iter min="0" max="6"> <Or> <Word value="-" level="exact"/> <Word value="." level="exact"/> </Or> <Noblank/> <TokenRregexp value="[0-9A-Za-z_]+"/> <Noblank/> </Iter>
Capture	<Sub> defines a function to tag subparts of an annotation (kind, value) for later retrieval. For example, the day of the week is annotated (sub, 1), the day of the month (sub, 2), the month (sub, 3), and the year (sub, 4). By concatenating subs in increasing order, we can get normalized dates.
Pattern referencing	Each operator can have a name (attribute name) used later for referencing the operator <PatternRef>.
Pattern reuse	The operator <TImport> allows the reuse of existing patterns from a file (attribute file name). To be reusable, a pattern must have a name.
Including Rules	The operator <TInclude> works like a #include in C/C++. It adds to the current TRules set all the TRule objects found in the specified file (attribute file name).

• The operator allows for sub expression matches retrieval. It has two attributes kind and value defining the annotation emitted each time the sub expression matches. These "numerical subs" are useful in match normalization. They are defined so that concatenating the text they have matched in increasing order of their value yields normalized matches. For example, people's names detection rules could mark first names with (sub, 1) and last names with (sub, 2) thus giving equal results after concatenation for "John Smith" and "Smith John". Of course, these submatch annotations are only emitted when the overall pattern matches.