• Extracts the following entities based on triggers and units.
• Interprets matched entities with rewrite rules.
Entities and Attributes
The following entities are available:
Entity type
Description
<TextEntity>
extracts string values
<BooleanEntity>
extracts Boolean values
<IntegerEntity>
extracts integer values
<FloatingPointEntity>
extracts floating point values
<RangeEntity>
extracts ranged string values
<RegexpEntity>
extracts string values following a pattern
Entities are defined with the following common attributes and sets of specific attributes:
Common Entity Attributes
Attribute
Type
Description
trigger or (s)
string
An optional left context triggering the entity detection.
triggers
NamedObjectList
An optional list of left contexts triggering the match
leftContext
string
Alias for trigger
leftContexts
NamedObjectList
Alias for triggers
annotation
string
The annotation set by the processor in case of a match
display
string
The annotation value in case of a match.
Each ? in your custom display form is replaced by the matching value.
matchMode
string
The level used to match the feature: normalized, lowercase, or exact
name
string
Unique entity reference. Only used by the GUI.
unit
String
An optional condition on right context
units
NamedObjectList
An optional list of conditions on right context
rightContext
String
Alias for unit
rightContexts
NamedObjectList
Alias for units
Text Entity Specific Attributes
Attribute
Type
Description
value
string
The values to match, separated with '|'
maxValueSize
int
Maximum number of tokens for the value
lang
iso code
Restricts matching to a specific input language.
Boolean Entity Specific Attributes
Attribute
Type
Description
yes
string
The value for true
no
string
The value for false
Integer and Floating Point Entity Specific Attributes
Attribute
Type
Description
step
int
If defined, normalizes the value to the nearest step.
default=0
min
int
Minimum value for a match.
default=-2147483648
max
int
Maximum value for a match.
default=2147483647
coefficient
double
The normalization coefficient used to multiply the matched value.
default=1
precision
int
The floating-point precision in output.
default=0
handleOutOfBoundValues
Boolean
If false, ignores values lower than min or greater than max.
default=True
addExactValue
Boolean
If true, adds the original value (before normalization) in an extra annotation TAG.exact
default=False
truncateTrailingZeros
Boolean
If true, removes the trailing zeros after point.
default=False
Range Entity Specific Attributes
Attribute
Type
Description
dimension
int
Dimension count.
default=2
delimiter
string
Numbers delimiter.
default="x"
Regexp Entity Specific Attributes
Attribute
Type
Description
value
string
Perl-5 regular expression
endTrigger
string
An optional left context
endTriggers
NamedObjectList
An optional list of left contexts
Rule Attributes
A Rule is defined by the following attributes:
• name
• mode
• value (pattern)
• output string
• trustLevel
Dependencies
If the matching rules for this processor depend on phonetic, stem, or lemma matching, you must add the corresponding processor above this one in the pipeline.
For example, if your rules require phonetic forms, place the Phonetizer processor above this processor in the analysis pipeline.
Sample Semantic Extractor XML File
The Semantic Extractor configuration is made of two nonmandatory parts: a list of entity definitions and a list of rules. There are also two operators for macro definitions and inclusion of external configurations.
Each matching entity generates a semantic annotation, which can be mapped using the standard annotation mappings. Similarly, each matching rule generates a semantic annotation identified by the rule's name.
Let us use the above sample to process the following text: "SD Card 4 GB". This generates the following annotations:
• "sdcard", because the processor detected the expression "SD". It displays as "SD Card".
• "capacity", because the processor initially detected the integer "4". It displays as "4 GB" .
• "type", because the processor detected both the "sdcard" and "capacity" annotations, but did not detect the "hddtype" annotation. It displays as "Camera".
Entities Syntax
Text
Extract values according to the triggers (left context), values and units (right context) you specified, as defined by:
<!-- match the expressions "SD card", "sd card", "SD CaRd", ... --> <TextEntity value="SD card" annotation="sd_card" matchMode="normalized" />
To specify multiple value attributes, use the '|' character as a separator:
<!-- match the expression "SD card" or "Carte SD" --> <TextEntity value="SD card|Carte SD" annotation="sd_card" matchMode="exact"/>
To restrict matching, use triggers to specify a left context:
<!-- match the expressions "Operating System: Mac OS X" or "Operating System: Windows XP" --> <TextEntity trigger="Operating System" annotation="os" matchMode="normalized" value="Mac OS X|Windows XP" />
To specify multiple triggers:
<!-- match the expressions "Operating System: Mac OS X", "Operating System: Windows XP", "Système d'exploitation: Mac OS X" or "Système d'exploitation: Windows XP" --> <TextEntity annotation="os" matchMode="normalized" value="Mac OS X|Windows XP"> <triggers> <bee:StringValue value="Operating System" /> <bee:StringValue value="Système d'exploitation" /> </triggers> </TextEntity>
To specify how many tokens are used to build the value, omit the value attribute and you use the maxValueSize attribute:
<!-- match the expressions "Operating System: something really unknown" --> <TextEntity trigger="Operating System" annotation="os_unknown" matchMode="normalized" maxValueSize="10" />
Boolean
This condition performs a Boolean match. For example, this condition matches the expression "SD card: YES".
<!-- match the expressions "SD card: YES" --> <BooleanEntity trigger="SD card" annotation="sd_card" matchMode="exact" yes="YES" no="NO" />
Integers
The integer entity is often used with normalization. It uses the following attributes:
Integer Entity Attributes
Attribute
Description
min
The minimum value the extracted number must have.
max
The maximum value the extracted number must have.
coefficient
The coefficient applied to the extracted number, default: 1 (no coefficient).
precision
The precision used to generate the display form (optional).
step
The step used to generate the display form (optional).
handleOutOfBoundValues
Generates display form with "<" or ">".
If false, skips numbers that are not in the [min-max] range.
default: true
addExactValue
If true, adds the value of the extracted number before any normalization in another annotation suffixed by ".exact"; default: false
<!-- match the expressions "Port USB: 2", and generate the display form "2" with an implicit display =?"--> <IntegerEntity trigger="port usb" annotation="port_usb" matchMode="normalized" />
<!-- match the expressions "Size: 1 Ko", and generate the display form "1024 octet" --> <IntegerEntity trigger="size" unit="Ko" annotation="size" matchMode="normalized" coefficient="1024" display="? octet" />
<!-- match the expressions "HDD: 320 Go", and generate the display form "300-400 Go" --> <IntegerEntity trigger="HDD" unit="Go" annotation="hdd_capacity" matchMode="normalized" min="0" max="2000" step="100" display="? Go" />
<!-- match the expressions "HDD: 320 Go", and generate the display form "300-400 Go" + an exact annotation with display form "320" --> <IntegerEntity trigger="HDD" unit="Go" annotation="hdd_capacity" matchMode="normalized" min="0" max="2000" step="100" display="? Go" addExactValue="true" />
Floating-Point
<!-- match the expressions "Port USB: 2", and generate the display form "2" with an implicit display =?"--> <IntegerEntity trigger="port usb" annotation="port_usb" matchMode="normalized" />
<!-- match the expressions "Size: 1 Ko", and generate the display form "1024 octet" --> <IntegerEntity trigger="size" unit="Ko" annotation="size" matchMode="normalized" coefficient="1024" display="? octet" />
<!-- match the expressions "HDD: 320 Go", and generate the display form "300-400 Go" --> <IntegerEntity trigger="HDD" unit="Go" annotation="hdd_capacity" matchMode="normalized" min="0" max="2000" step="100" display="? Go" />
<!-- match the expressions "HDD: 320 Go", and generate the display form "300-400 Go" + an exact annotation with display form "320" --> <IntegerEntity trigger="HDD" unit="Go" annotation="hdd_capacity" matchMode="normalized" min="0" max="2000" step="100" display="? Go" addExactValue="true" />
<FloatingPointEntity> has the same attributes as the <IntegerEntity>, but use floats instead of integers.
Ranged
The <RangeEntity> is used to extract ranged expressions.
Range Entity Attributes
Attribute
Description
dimension
The number of dimensions in your entity
delimiter
The delimiter used to separate dimensions
<!-- match the expressions "Resolution: 800 x 600" --> <RangeEntity trigger="resolution" annotation="resolution" matchMode="normalized" dimension="2" delimiter="x" />
<!-- match the expressions "Size: 42 x 12 x 32" --> <RangeEntity trigger="size" annotation="size" matchMode="normalized" dimension="3" delimiter="x" />
RegExp
The <RegexpEntity> extracts expressions matching a pattern.
<!-- will match the expressions "Opening hours: 09:00 - 18:00" --> <RegexpEntity trigger="Opening hours" annotation="opening" matchMode="normalized" value="[[:digit:]]+:[[:digit:]]+ - [[:digit:]]+:[[:digit:]]+" />
<!-- will match the expressions "Start at: 09:00 am", "Start at: 01:30 pm" --> <RegexpEntity trigger="Start at" annotation="start_at" matchMode="normalized" value="[[:digit:]]{2}:[[:digit:]]{2}"> <endTriggers> <bee:StringValue value="am" /> <bee:StringValue value="pm" /> </endTriggers> </RegexpEntity>
Rules Syntax
Mode
The mode specifies if the rule is a positive one (value match) or an exception (value filter). A matching exception prevents any rule from matching.
Pattern
A pattern is made of atoms and operators. An atom defines a basic matching element in a rule, it can take the following forms:
• A reference to an entity through its annotation (country).
• A regular expression surrounded by slashes possibly capturing parts of a match (/[pP].*(ern)/).
• A full-text string surrounded by quotes ("sequence of words").
• A reference to a MOT annotation (NE.people).
• An assertion enforcing match constraints:
◦ :WORD matches any word in a sequence.
◦ :ENDLINE matches carriage returns.
◦ :START matches at the beginning of the input text.
◦ :END matches at the end of the input text.
Each atom may have a set of options specified in curly braces immediately following it. That set is a combination of 0 or more elements separated with commas.
Atom Options
Option
Description
name
allows the atom to be referenced in the output format of a rule through $(country).
opt
matching of this atom is not required.
not
the pattern matches if this atom does not.
max
maximum number of words allowed for a :WORD atom to match.
original
use the matching text when building the output rather than the output value of an entity or the display form of an annotation.
default
defines a default value for an optional atom (option opt) to use when building the output.
exact
requires exact matching level.
lower
requires lowercase matching level.
norm
requires normalized matching level.
skipPunct
requires punctuation-insensitive matching
fuzzy
requires approximative matching.
Atom Operators
Operator
Description
AND
Default operator. Matches atoms in any order and positions.
[ ]
Matches the enclosed sequence of atoms in the order specified.
( )
Matches the enclosed atoms at the same position (cumulative constraints on a single position).
Output
The final annotation has the name of the matching rule and the rule trust level if any (default is 100). The display form is defined with an output format that may contain:
• A reference to what has been matched by atom through its name ($(country)).
• A reference to a regular expression capture through the atom name and an array access syntax specifying the capture number ($(regexp[0])).
• A carriage return \n.
• Any character sequence, which is used as-is in the output (the street is:$(street) and the city is:$(city)).
If defined, the processor's parameter prefix forces the output annotation to be prefixed with its value.
Macros
For often-used rule parts, use the <Define> element to write macros that you can then reference in a rule value.
A macro has a name and a value. These are substituted each time the macro is referenced with a #, followed by the macro’s name.
• This name can only contain the characters [a-z][A-Z][0-9].
• The macro can be used anywhere in a rule, such as inside a regular expression.
• To disable macro substitution, insert a backslash before the # symbol.
Create the Semantic Extractor Resource File
This section explains how to compile the semantic extractor’s resource file, and describes the parameter options available in the Administration Console.
Create a Resource File from the Administration Console
cvconsole cvadmin> linguistic compile-semantic-extractor input="<PATH TO XML FILE>" output="<PATH TO OUTPUT FILE>"
3. In the Administration Console, select Index > Data processing > Pipeline name > Semantic Processors.
4. Drag the Semantic Extractor processor to the required position in the Current Processors list, then specify the parameters:
Option
Description
Resource directory (required)
enter the URL to the compiled semantic extractor file. Use the format data://, file://, or resource://.
Break on sentence
maximum of one match per sentence, no match intersentence if true (default false).
Break on paragraph
maximum one match per paragraph, no match interparagraph if true (default true).
Break on line
maximum one match per line, no match interlines if true (default false).
Match all rules
returns the matches for all rules if true, otherwise stops after the first matching rule (default true).
Map the Annotation to a Category Facet
1. In the Administration Console, select Index > Data processing > Pipeline name > Semantic Processors.
2. On the Mappings subtab, click Add mapping sources.
a. Name: Enter the annotation name that you created in the rules file.
b. Type: select Annotation.
3. (Optional) In Input from field of the mapping, restrict the mapping so it only applies to a subset of comma-separated metas (also known as contexts) associated with this annotation.
4. Click Add mapping target and add a category target.
5. Modify the Category Mappingproperties. For example, the Create categories under this root property must be modified to Top/Megapixel in the example above.
6. Go to Search > Search Logics > Your_Search_Logic > Facets and add a category group.
a. Click Add facet and enter the name to display in the Mashup UIRefinements panel, for example Megapixel.
b. For Root, enter the value you entered for Create categories under this root in step 4, for example, Top/Megapixel.