Semantic Extractor

Configuration : Appendix - Configure Semantic Processors : Semantic Extractor

Semantic Extractor

Entities and Attributes

Rule Attributes

Dependencies

Sample Semantic Extractor XML File

Entities Syntax

Rules Syntax

Macros

Create the Semantic Extractor Resource File

Map the Annotation to a Category Facet

The Semantic Extractor performs the following:

• Extracts the following entities based on triggers and units.

• Interprets matched entities with rewrite rules.

Entities and Attributes

The following entities are available:

Entity type	Description
<TextEntity>	extracts string values
<BooleanEntity>	extracts Boolean values
<IntegerEntity>	extracts integer values
<FloatingPointEntity>	extracts floating point values
<RangeEntity>	extracts ranged string values
<RegexpEntity>	extracts string values following a pattern

Entities are defined with the following common attributes and sets of specific attributes:

Common Entity Attributes
Attribute	Type	Description
trigger or (s)	string	An optional left context triggering the entity detection.
triggers	NamedObjectList	An optional list of left contexts triggering the match
leftContext	string	Alias for trigger
leftContexts	NamedObjectList	Alias for triggers
annotation	string	The annotation set by the processor in case of a match
display	string	The annotation value in case of a match. Each ? in your custom display form is replaced by the matching value.
matchMode	string	The level used to match the feature: normalized, lowercase, or exact
name	string	Unique entity reference. Only used by the GUI.
unit	String	An optional condition on right context
units	NamedObjectList	An optional list of conditions on right context
rightContext	String	Alias for unit
rightContexts	NamedObjectList	Alias for units

Text Entity Specific Attributes
Attribute	Type	Description
value	string	The values to match, separated with '\|'
maxValueSize	int	Maximum number of tokens for the value
lang	iso code	Restricts matching to a specific input language.

Boolean Entity Specific Attributes
Attribute	Type	Description
yes	string	The value for true
no	string	The value for false

Integer and Floating Point Entity Specific Attributes
Attribute	Type	Description
step	int	If defined, normalizes the value to the nearest step. default=0
min	int	Minimum value for a match. default=-2147483648
max	int	Maximum value for a match. default=2147483647
coefficient	double	The normalization coefficient used to multiply the matched value. default=1
precision	int	The floating-point precision in output. default=0
handleOutOfBoundValues	Boolean	If false, ignores values lower than min or greater than max. default=True
addExactValue	Boolean	If true, adds the original value (before normalization) in an extra annotation TAG.exact default=False
truncateTrailingZeros	Boolean	If true, removes the trailing zeros after point. default=False

Range Entity Specific Attributes
Attribute	Type	Description
dimension	int	Dimension count. default=2
delimiter	string	Numbers delimiter. default="x"

Regexp Entity Specific Attributes
Attribute	Type	Description
value	string	Perl-5 regular expression
endTrigger	string	An optional left context
endTriggers	NamedObjectList	An optional list of left contexts

Rule Attributes

A Rule is defined by the following attributes:

• name

• mode

• value (pattern)

• output string

• trustLevel

Dependencies

If the matching rules for this processor depend on phonetic, stem, or lemma matching, you must add the corresponding processor above this one in the pipeline.

For example, if your rules require phonetic forms, place the Phonetizer processor above this processor in the analysis pipeline.

Sample Semantic Extractor XML File

The Semantic Extractor configuration is made of two nonmandatory parts: a list of entity definitions and a list of rules. There are also two operators for macro definitions and inclusion of external configurations.

Each matching entity generates a semantic annotation, which can be mapped using the standard annotation mappings. Similarly, each matching rule generates a semantic annotation identified by the rule's name.

Let us use the above sample to process the following text: "SD Card 4 GB". This generates the following annotations:

• "sdcard", because the processor detected the expression "SD". It displays as "SD Card".

• "capacity", because the processor initially detected the integer "4". It displays as "4 GB" .

• "type", because the processor detected both the "sdcard" and "capacity" annotations, but did not detect the "hddtype" annotation. It displays as "Camera".

Entities Syntax

Text

Extract values according to the triggers (left context), values and units (right context) you specified, as defined by:

(trigger1 | trigger2 ...)? (separators)* VALUE_TO_EXTRACT (separators)* (unit1 | unit2 ...)?

A simple example:

<TextEntity value="SD card" annotation="sd_card" matchMode="normalized" />

To specify multiple value attributes, use the '|' character as a separator:

<TextEntity value="SD card|Carte SD" annotation="sd_card" matchMode="exact"/>

To restrict matching, use triggers to specify a left context:

<TextEntity trigger="Operating System" annotation="os" matchMode="normalized" value="Mac OS X|Windows XP" />

To specify multiple triggers:

<TextEntity annotation="os" matchMode="normalized" value="Mac OS X|Windows XP">
<triggers>
<bee:StringValue value="Operating System" />
<bee:StringValue value="Système d'exploitation" />
</triggers>
</TextEntity>

To specify how many tokens are used to build the value, omit the value attribute and you use the maxValueSize attribute:

<TextEntity trigger="Operating System" annotation="os_unknown" matchMode="normalized" maxValueSize="10" />

Boolean

This condition performs a Boolean match. For example, this condition matches the expression "SD card: YES".

<BooleanEntity trigger="SD card" annotation="sd_card" matchMode="exact" yes="YES" no="NO" />

Integers

The integer entity is often used with normalization. It uses the following attributes:

Integer Entity Attributes
Attribute	Description
min	The minimum value the extracted number must have.
max	The maximum value the extracted number must have.
coefficient	The coefficient applied to the extracted number, default: 1 (no coefficient).
precision	The precision used to generate the display form (optional).
step	The step used to generate the display form (optional).
handleOutOfBoundValues	Generates display form with "<" or ">". If false, skips numbers that are not in the [min-max] range. default: true
addExactValue	If true, adds the value of the extracted number before any normalization in another annotation suffixed by ".exact"; default: false

<IntegerEntity trigger="port usb" annotation="port_usb" matchMode="normalized" />


<IntegerEntity trigger="size" unit="Ko" annotation="size" matchMode="normalized" coefficient="1024"
display="? octet" />


<IntegerEntity trigger="HDD" unit="Go" annotation="hdd_capacity" matchMode="normalized" min="0" max="2000"
step="100" display="? Go" />


<IntegerEntity trigger="HDD" unit="Go" annotation="hdd_capacity" matchMode="normalized" min="0" max="2000"
step="100" display="? Go" addExactValue="true" />

Floating-Point

<IntegerEntity trigger="port usb" annotation="port_usb" matchMode="normalized" />


<IntegerEntity trigger="size" unit="Ko" annotation="size" matchMode="normalized" coefficient="1024"
display="? octet" />


<IntegerEntity trigger="HDD" unit="Go" annotation="hdd_capacity" matchMode="normalized" min="0"
max="2000" step="100" display="? Go" />


<IntegerEntity trigger="HDD" unit="Go" annotation="hdd_capacity" matchMode="normalized" min="0"
max="2000" step="100" display="? Go" addExactValue="true" />

<FloatingPointEntity> has the same attributes as the <IntegerEntity>, but use floats instead of integers.

Ranged

The <RangeEntity> is used to extract ranged expressions.

Range Entity Attributes
Attribute	Description
dimension	The number of dimensions in your entity
delimiter	The delimiter used to separate dimensions

<RangeEntity trigger="resolution" annotation="resolution" matchMode="normalized" dimension="2" delimiter="x" />


<RangeEntity trigger="size" annotation="size" matchMode="normalized" dimension="3" delimiter="x" />

RegExp

The <RegexpEntity> extracts expressions matching a pattern.

<RegexpEntity trigger="Opening hours" annotation="opening" matchMode="normalized"
value="[[:digit:]]+:[[:digit:]]+ - [[:digit:]]+:[[:digit:]]+" />


<RegexpEntity trigger="Start at" annotation="start_at" matchMode="normalized"
value="[[:digit:]]{2}:[[:digit:]]{2}">
<endTriggers>
<bee:StringValue value="am" />
<bee:StringValue value="pm" />
</endTriggers>
</RegexpEntity>

Rules Syntax

Mode

The mode specifies if the rule is a positive one (value match) or an exception (value filter). A matching exception prevents any rule from matching.

Pattern

A pattern is made of atoms and operators. An atom defines a basic matching element in a rule, it can take the following forms:

• A reference to an entity through its annotation (country).

• A regular expression surrounded by slashes possibly capturing parts of a match (/[pP].*(ern)/).

• A full-text string surrounded by quotes ("sequence of words").

• A reference to a MOT annotation (NE.people).

• An assertion enforcing match constraints:

◦ :WORD matches any word in a sequence.

◦ :ENDLINE matches carriage returns.

◦ :START matches at the beginning of the input text.

◦ :END matches at the end of the input text.

Each atom may have a set of options specified in curly braces immediately following it. That set is a combination of 0 or more elements separated with commas.

Atom Options
Option	Description
name	allows the atom to be referenced in the output format of a rule through $(country).
opt	matching of this atom is not required.
not	the pattern matches if this atom does not.
max	maximum number of words allowed for a :WORD atom to match.
original	use the matching text when building the output rather than the output value of an entity or the display form of an annotation.
default	defines a default value for an optional atom (option opt) to use when building the output.
exact	requires exact matching level.
lower	requires lowercase matching level.
norm	requires normalized matching level.
skipPunct	requires punctuation-insensitive matching
fuzzy	requires approximative matching.

Atom Operators
Operator	Description
AND	Default operator. Matches atoms in any order and positions.
[ ]	Matches the enclosed sequence of atoms in the order specified.
( )	Matches the enclosed atoms at the same position (cumulative constraints on a single position).

Output

The final annotation has the name of the matching rule and the rule trust level if any (default is 100). The display form is defined with an output format that may contain:

• A reference to what has been matched by atom through its name ($(country)).

• A reference to a regular expression capture through the atom name and an array access syntax specifying the capture number ($(regexp[0])).

• A carriage return \n.

• Any character sequence, which is used as-is in the output (the street is:$(street) and the city is:$(city)).

If defined, the processor's parameter prefix forces the output annotation to be prefixed with its value.

Macros

For often-used rule parts, use the <Define> element to write macros that you can then reference in a rule value.

A macro has a name and a value. These are substituted each time the macro is referenced with a #, followed by the macro’s name.

• This name can only contain the characters [a-z][A-Z][0-9].

• The macro can be used anywhere in a rule, such as inside a regular expression.

• To disable macro substitution, insert a backslash before the # symbol.

Create the Semantic Extractor Resource File

This section explains how to compile the semantic extractor’s resource file, and describes the parameter options available in the Administration Console.

Create a Resource File from the Administration Console

The most convenient method consists in creating an empty resources file in the Administration Console and defining its content with the Business Console. See Create a Resource File from the Administration Console .

To Compile a Resource File from the Command Line

1. Create a rule XML file and save it in a temporary directory. For an example, see Sample Semantic Extractor XML File.

2. Compile the XML file.

a. Go to <DATADIR>/bin/

b. Run the following cvadmin command:

cvconsole cvadmin> linguistic compile-semantic-extractor input="<PATH TO XML FILE>"
output="<PATH TO OUTPUT FILE>"

3. In the Administration Console, select Index > Data processing > Pipeline name > Semantic Processors.

4. Drag the Semantic Extractor processor to the required position in the Current Processors list, then specify the parameters:

Option	Description
Resource directory (required)	enter the URL to the compiled semantic extractor file. Use the format data://, file://, or resource://.
Break on sentence	maximum of one match per sentence, no match intersentence if true (default false).
Break on paragraph	maximum one match per paragraph, no match interparagraph if true (default true).
Break on line	maximum one match per line, no match interlines if true (default false).
Match all rules	returns the matches for all rules if true, otherwise stops after the first matching rule (default true).

Map the Annotation to a Category Facet

1. In the Administration Console, select Index > Data processing > Pipeline name > Semantic Processors.

2. On the Mappings subtab, click Add mapping sources.

a. Name: Enter the annotation name that you created in the rules file.

b. Type: select Annotation.

3. (Optional) In Input from field of the mapping, restrict the mapping so it only applies to a subset of comma-separated metas (also known as contexts) associated with this annotation.

4. Click Add mapping target and add a category target.

5. Modify the Category Mappingproperties. For example, the Create categories under this root property must be modified to Top/Megapixel in the example above.

6. Go to Search > Search Logics > Your_Search_Logic > Facets and add a category group.

a. Click Add facet and enter the name to display in the Mashup UI Refinements panel, for example Megapixel.

b. For Root, enter the value you entered for Create categories under this root in step 4, for example, Top/Megapixel.

7. Click Apply.