Dictionary is a separate structure from the index that stores all the words from an indexed document, plus their number of occurrences in the corpus. It serves for linguistic expansion mechanisms such as spell-checking or regular expression matching.
During installation, features requiring a dictionary are set up with the default dictionary, dict0. You can change the configuration of dictionary resources in the default dictionary or create additional dictionaries to suit your needs.
Dictionary Resources
All these resources are already configured for the default dictionary, dict0. Use this list to change default settings or to build new dictionaries.
Resource
Description
Words
Stores words & their frequency to calculate relevance and term expansion.
If word occurrences are under the specified Min Frequency, they do not appear in the dictionary.
Ngrams
Used to improve spell check accuracy.
PREREQUISITE: Select the Extract spell check ngrams for the semantic types associated with this dictionary.
Phonetic Forms
Used to improve spell check accuracy, to calculate relevance and term expansion. It is required for phonetic term expansion.
PREREQUISITE: a phonetic semantic processor must be defined in the pipeline, or you must select the Extract phonetic forms option for the semantic types associated with this dictionary.
Related Terms
Required to provide related terms in this language.
PREREQUISITE: Define a related terms semantic processor in the pipeline.
Multiple Dictionaries
Exalead CloudView supports multiple dictionaries. Each dictionary is configured separately with its own name, maximum size, and so forth.
• On the indexing side, you can configure a semantic type to use a specific dictionary. So when you associate a data model property with a semantic type, it ensures that the generated index field is associated to a specific dictionary. This dictionary can only contain words likely to appear in that field.
• Symmetrically, each prefix-handler at search time can target a specific dictionary (for regexp search, etc.).
Moreover, the dictionary allows you to define filtering rules for controlling which words are stored in the dictionary. This allows you to store only words with a minimum number of characters, or words matching a regular expression.
Setting Up a Dictionary
Create a New Dictionary
1. In the Administration Console, go to Index > Linguistics > Dictionaries.
2. Click Add Dictionary.
TIP: For Creation mode, select copy.
To determine which elements you need in this dictionary, see About Dictionaries.
3. Click Apply.
Associate a Dictionary to Metas via Semantic Types
1. In the Administration Console, go to Index > Data Model > Semantic Type.
2. Expand a semantic type, and in the Dictionary field, select the dictionary.
Note: If you do not want to store words in a dictionary, select None.
3. Select the prerequisite options, depending on which elements are in your dictionary. See About Dictionaries.
4. Click Apply.
Associate a Dictionary to Metas via Mappings
1. In the Administration Console, go to Index > Data processing > pipeline name > Mappings.
2. Under the Mapping sources column, expand the meta you want to associate with a dictionary.
3. Under the Mapping targets column, select the dictionary name, and then under the Details column, select the elements where you want this meta to be stored in the dictionary. See About Dictionaries.
4. Repeat for all mappings you want to associate to dictionaries.
5. Click Apply.
Change the Default Dictionary
The first dictionary in your list of dictionaries is the default dictionary. Since a new Exalead CloudView installation only includes one dictionary, dict0, it automatically becomes the default dictionary.
1. To set another default dictionary, use the Default dictionary list under Dictionary.
Set Up a Dictionary Resource
This procedure shows how to set up the Words resource. You can configure other resources similarly.
1. In the Administration Console, go to Index > Linguistics > Dictionaries.
2. Select (or add) your dictionary.
3. Expand Words.
4. Under Actions, click the edit tool next to the language you want to configure.
5. From the Edit language config dialog box, configure:
◦ Max No. terms: Set the maximum number of terms allowed for the selected language.
◦ Min frequency: How often the word needs to occur for it to be stored for that language in the dictionary.
◦ Regexp filter: Define a pattern of words to exclude from the dictionary for this language.
6. Click Accept.
Compacting and Building Dictionaries
The dictionary capabilities include compact and building policies.
• Compact policies: Dictionary data is regularly compacted after N import operations and/ or N seconds, to keep a single file per resource.
• Build policies: Dictionaries are regularly rebuilt after N compact operations and/ or N seconds to be up-to-date.
The following procedures explain how to configure compact and build operations.
Compact Individual Dictionaries
1. In the Administration Console, go to Index > Linguistics > Dictionaries > Dictionary > dictn > Configuration.
2. Select Enable compact and specify the compact policy.
◦ Choose to compact when N import streams have been done since the last compact operation.
◦ Choose to compact every N second.
3. Click Save and Apply.
Fine-Tune the Compact Size
1. Edit the <DATADIR>/config/Dictionary.xml file
2. Add a FrequencyCompactFilter to the CompactPolicies node, as shown in the following example.