Compound Words Splitter

Configuration : Appendix - Configure Semantic Processors : Compound Words Splitter

The Compound Words Splitter processor splits CamelCase, quiteCamelCase and underscored_case words into separate words.

Example

Input	Output
SearchServer	Search Server
simpleSearchServer	simple Search Server
simple_value	simple value

To allow searching for these words individually, you must use Tokenize annotations option. It creates tokens for each root word of the compound word. You need it to index values since annotations are not tokenized (same behavior as the spellchecker).

For example:

Input	Output
SearchServer	• Search • Server
simpleSearchServer	• simple • Search • Server
simple_value	• simple • value

When to Use

The use cases where this processor is useful are manifold. Among others, we could use it for:

• Agglutinated data coming from a database. For example, agglutinated names like JohnSteed, EmmaPeel, JohnGambitt, etc.

• Source code to search for variables and class and function names. Searching is more convenient when these compound names are split into multiple words, for example, when you want the query search to retrieve a document containing SearchServer.

Note: If you need to index "real" compound words without uppercase and underscores (for example, wheelchair, editor-in-chief, etc.) use a standard tokenization. For more information, see Customizing the Tokenization Config.

Dependencies

Add a Normalizer processor in the analysis pipeline if you do not want to index exact forms only, but also support lowercase and normalized forms for uncompound words.