Configuration : Appendix - Configure Semantic Processors : Named Entities Matcher
 
Named Entities Matcher
 
When to Use
Which Entities are extracted?
Filtering Options
Named Entities Classes and Subclasses
Extract Your Own Named Entities
Set Block Lists and Allow Lists for Named Entities Extraction
The Named Entities Matcher is used for named entity detection, typically people, organizations, places, and events.
See Named Entities Classes and Subclasses for a list of detected entities.
When to Use
The primary goal of named entities extraction is to enrich document with valuable labels. These labels are self-sufficient.
For example, when mapped to facets, they provide useful navigation entry points. But they may also be used as input for further processing such as relation discovery or to highlight some relevant keywords.
Which Entities are extracted?
The Set of rules to use option defines the resource that is used to produce the annotations. Each value matches a different resource.
The default value ne triggers the extraction of people, organizations, places, and events. It is the only resource that is entirely pre-configured.
For all other resources, you must map allowed NE annotations to categories (see the tooltip corresponding to each resource to know which annotations can be mapped).
See Named Entities Classes and Subclasses for a list of detected entities.
Note: The value ne-all triggers the extraction of all types of entities. The value ne-basic2 triggers the extraction of extra entities.
Filtering Options
You can also use the following options:
Filter out dubious NEs using part-of-speech – to discard NE annotations for parts of text made of a name followed by a verb or an adverb with the first letter in uppercase. This filter is useful if your documents contain a lot of titles with several capitalized words (what is called "Title Case" or "Headline Style"). It applies to NE.person, NE.place and NE.organization.
For example, let us consider the text "John Make":
"Make" is tagged as a "verb" token by the part-of-speech tagger embedded within this processor, and therefore "John Make" is identified as dubious, and the NE.person annotation is discarded.
Use resource of known words to disambiguate NE candidates – This option is only available for English and French. Beware, this option may be very restrictive and is mainly useful to avoid getting too much Named Entity "noise".
It uses a precompiled dictionary resource of known words to disambiguate named entities candidates (the precompiled resource files, known_words.fr and known_words.en are located under <INSTALLDIR>/resource/all-arch/namedentities). These known words are all types of words of a language, nouns, verbs, adverbs, articles, etc.
For example, if this option is selected, "J. Brown" for "James Brown" is not detected as a Named Entity. The initial "J." + the adjective "Brown" is considered as too ambiguous to be considered as a Named Entity.
Yet Disambiguation does not ignore title abbreviations. For example, "Miss Smith", "Mr Smith", "Dr Brown" is detected as Named Entities. It also works if a firstname initial stands between the title abbreviation and the known word, for example, "Dr J. Brown" is detected as a Named Entity too.
Note: These filters involve additional processing, which makes the global process consume more resources.
Example of Named Entities with Filters
Text
Filter using POS
Use Known words
both filters
No filter
J. Brown
NE.person
no annotation
no annotation
NE.person
J. Told
NE.person
no annotation
no annotation
NE.person
Mr Brown
NE.person
NE.person
NE.person
NE.person
Mr Told
no annotation
NE.person
no annotation
NE.person
Mr J. Brown
NE.person
NE.person
NE.person
NE.person
Mr J. Told
NE.person
NE.person
NE.person
NE.person
Teddy Brown
NE.person
NE.person
NE.person
NE.person
Teddy Told
no annotation
NE.person
no annotation
NE.person
Named Entities Classes and Subclasses
These Named Entities classes and subclasses are based on several standard schemas defined on http://schema.org.
List of Named Entities Classes and Subclasses Detected by the Named Entity Matcher
NE Type
Annotations
Description
Examples
People
NE.person
Rule-based matching and an ontology of first names, titles.
"John Smith"
subclasses:
NE.famousperson
Exact name matching based on an ontology and rules
"Albert Einstein"
NE.partialperson
Patterns in a rules matcher
"Mr Smith" or "J. Smith"
Organization
NE.organization
Based on ontology and rules
"EXALEAD"
"Independant Human Right Commission"
subclasses:
NE.organization.corporation
"EXALEAD"
"Walt Disney Company"
"Burger King"
NE.organization.governmentorganization
"NATO"
"Department of Defense"
"The Supreme Court"
NE.organization.nongovernmentorganization
"Greenpeace"
"Sea Sheperd Conservation Society"
NE.organization.educationalorganization
"Harvard"
"MIT"
"Science-Po Paris"
NE.organization.sportsteam
"Arsenal"
"PSG"
"Lakers"
NE.organization.miscellaneousorganization
"PADI"
"Ju-Jitsu Association"
Place
NE.place
Ontology-based matching
"New Orleans"
subclasses:
NE.place.city
"Cambridge"
NE.place.country
"United Kingdom"
NE.place.state
"California"
NE.place.otheradministrativearea
"Greater London"
NE.place.landform
"Mediterranean Sea"
"The Highlands"
NE.place.civicstructure
"Madison Square Garden"
"Royal Albert Hall"
Event
NE.event
Rule-based matching
"2nd New York Jazz Festival"
"London 2012"
subclasses:
NE.event.cultural
"Avignon Theater Festival"
"Asian Regional Meeting"
"Cuba's Bishops Conference"
NE.event.military
"Falklands War"
"World-War-II"
"Battle of Waterloo"
NE.event.natural
"Hurricane Katrina"
"Blizzard of 1993"
NE.event.political
"French presidential election"
"Inauguration of Barack Obama"
NE.event.religious
"Easter Monday"
"Aïd el Kebir"
"Pessah"
NE.event.social
"Independence Day"
"World Day for Migrants and Refugees"
NE.event.sport
"2008 Summer Olympics"
"Football World Cup"
"Moto GP Championship"
NE.event.security
"Suicide bombing"
"Spinboldak attack"
Date
There are several annotations, see below
Rule-based matching
NE.date
Normalized to European numerical standard "day month year" with two-digit days and months
"14 06 1982"
"05 12 2003"
NE.date.full
If found, the normalized day of the week is prepended
"Mon 13 02 1977" (English)
"Lun 13 02 1977" (French)
NE.date.uk, NE.date.us
For English text, two annotations are set for ambiguous dates. Use the annotation NE.date.uk for British texts and NE.date.us for American texts
Price
NE.money
Rule-based matching and ontology for currencies.
"$2.73"
"4,5€"
"three hundred million dollars"
subclasses:
The following subclasses aim at simplifying currency conversions
currency.unity
dollar US
currency.quantity
150
French postal address
NE.address.fr
Rule-based matching and ontology of French cities
"10 place de la Madeleine, 75008 Paris"
French phone number
NE.phone.fr
Rule-based matching
"(+33)6.82.33.15.12"
"05 64 222 222"
Time, duration, and time ranges in French
NE.time
Rule-based matching
"13h45"
"3 h 56 min 12 sec"
"de 7h03 à 17h28"
Email
NE.email
Rule-based matching
john.smith@gmail.com
URL
NE.url
Rule-based matching
"https://www.exalead.com"
IP v4 address
NE.ip
Rule-based matching
"192.168.204.120"
Credit card
NE.creditcard
Generated by Basis Tech tokenizer and rule-based matching
"378282246310005" (American Express)
Note:  
The following formats are not supported:
Australian BankCard: 5610591081018250
Some VISA Number pattern: 4222222222222
Dankort (PBS): 76009244561
Dankort (PBS): 5019717010103742
Switch/Solo (Paymentech): 6331101999990016
For an example of Named Entities processing, see Test the Semantic Processing of your Analysis Pipeline.
Extract Your Own Named Entities
The rules for the Named Entity matcher are packaged in the product. Depending on the entity type, matching is based on:
a predefined ontology,
predefined rules,
or a combination of the two.
To enrich matches for certain entities, you can add your own Ontology Matcher and Rules Matcher processor to the analysis pipeline. The following table helps you defining the best configuration.
Extract entities with...
when...
for example
rules
Entities are either numerical or textual and values are countless.
Context can be identified. For example, in "$100", the "$" symbol shows us that 100 is a price.
Some parts of your entities are already annotated by the Named Entities Matcher or another resource.
dates, phone numbers, emails, URLs, addresses, prices, etc.
ontology resources
Entities are textual and values can be listed (not infinite)
A resource already exists (employees, categories)
The context does not help to identify them
Listing them is not a big challenge
You need to normalize output values.
first names, cities, days, months, etc.
To normalize group of entities like: USA, United States, United States of America, Etats Unis, Estados Unidos -> in United States of America
both rules and ontology resources
The number of values to extract is countless but parts of these entities are a clue to recognize it
Mr Obama
We need:
a resource to annotate Mr, Mister, Miss, etc.
a rule to extract persons’ names when we have an annotation next to a capitalized word.
Set Block Lists and Allow Lists for Named Entities Extraction
Block List People's Names
1. In the Administration Console, add an Ontology Matcher to the semantic pipeline after the Named Entities Matcher.
a. Expand the Ontology Matcher configuration panel, and click Create new to create a new ontology resource.
b. Click Apply.
c. Click Edit.
The Business Console opens to let you configure the ontology resource.
2. Create a blocklist.person ontology annotation that lists all the names you do not want to index.
a. Click Add annotation, and give it a name, for example, blocklist.person.
b. For this annotation, click Add display form, enter the name to block list in Match text form and click Add text form.
c. Repeat the previous substep for all the names you want to block list.
d. Click Go Live.
3. Create an xml file in your DATADIR/resource directory, for example myannotationmanager.xml and edit the file to copy the following code:
<AnnotationManager name="blocklist remover" xmlns="exa:com.exalead.linguistic.v10"
ignoreInvalidOperations="true">
<Remove annotation="NE.person" ifMatchWith="blocklist.person" />
</AnnotationManager>
4. Go back to the Administration Console, and add an Annotation Manager to the semantic pipeline after the Ontology Matcher.
a. Expand the Annotation Manager.
b. Click Browse and select the path of the xml file.
c. Click Apply.
5. Reindex all data.
Test Your Block List Configuration
1. In the Business Console, select Semantic > Resources.
2. Select your ontology block list.
3. Use the Test tab to test the semantic pipeline behavior.
a. Enter one of the block listed names in the text field.
b. The Annotations panel displays the blocklist.person tag.