The Indexing and Search Processes

The Exalead CloudView index is generational. When indexing documents, they are divided into batches known as jobs.

Each time you index a job, it creates a new generation of the index. Exalead CloudView stores this new generation of the index in a data structure called a slot. Each new slot is appended to the original index.

Once you commit the latest generation (slot) to the index, the index replicas are updated.

At search time, Exalead CloudView searches in all these slots of all the index replicas, and merges the results to return the final result set.

From time to time, slots are compacted to create an index with fewer slots, for more efficient searching.

Indexing is the process of scanning a document to create an index and to store its metas (content such as fields) into a collection.

The following diagram explains the process to update an index and its replicas.

The connectors retrieve documents from data sources, convert them into PAPI documents, and send these documents to the Indexing Server process.

For each document, the Indexing Server assigns:

Important: You can add the Consolidation Phase between Phase 1 and Phase 2. It allows you to transform and aggregate documents coming from different sources before pushing them to the Indexing Server. For more details on consolidation, see the Exalead CloudView Consolidation Server Guide.

Analysis is the process of formatting content and extracting information from documents pushed by connectors before storing them in the index.

In Exalead CloudView, an indexing job starts as soon as it receives a document. The analysis pipeline processes it immediately, using several threads for better performance.

Processors play an important role during the analysis phase. The document and semantic processors parse each document in the job to perform text extraction, semantic processing, custom operations, and mapping.

Commit Triggers define the conditions that prompts the saving of the analysis to the index.

When you commit, the results of the analysis create:

• An import to the index, which merges the data computed during the analysis with the data present in the index. This results in a new generation of the index. The new data resides in a new, separate slot in the index.

• Semantic annotations (linguistic statistical data) about the corpus to the dictionary builder of the indexing server process. This data can be used for query expansion and index-time semantic processing.

The index is now committed to disk.

After the new index data is committed, the new index generation is replicated on all index slices in the deployment. Once fully replicated, the new documents are available for searching.

Once the dictionary builder has received new semantic annotations, it updates the dictionary (or dictionaries, if you configured multiple ones) on the search server.

Search is triggered when end users (or a third-party application) submit a query to Exalead CloudView.

The following diagram shows how Exalead CloudView parses and expands queries before searching for matches on the index replicas.

The user enters a query in UQL (User Query Language). The Mashup UI (or a custom search API application) forwards the query to the Search API. If you configured security, the query includes the security tokens.

Parsing involves checking whether the query includes words and operators. If there are no operators in the original query, this step inserts the default operator (AND) between multiple words, unless the words are enclosed in quotations.

To linguistically expand this query, Exalead CloudView consults the dictionary to check for words available in the corpus, then sends a fully expanded query to the index slices.

The fully expanded query breaks down into a more granular query language known as ELLQL (Exalead Lower-level Query Language) so the index slices can understand it. During the step, the ELLQL query includes the security tokens so that the index slices can verify whether you have access to the matching documents.

Query execution constitutes of:

• Searching for the most relevant matches in all index slices.

All slices receive the query because each slice only contains a portion of the corpus. The hits from each slice are merged in the search server before returning all the matching hits to the user.