Documents can be defined as all the objects to be indexed by Exalead CloudView, regardless of file or entity type in the data source. For example, HTML, JPG or CSV files, database records are all considered documents within Exalead CloudView, since they are all converted into a Exalead CloudView-specific document format (also known as a PAPI document) after being scanned by a connector.
Items are the objects to be indexed by Exalead CloudView, regardless of file or entity type in the data source. For example, in OnePart, 3D CAD files, JPGs, PDFs are all considered items in the index.
A PAPI Document is an exchange format between the connectors and Exalead CloudView. It's an abstraction, so that all connectors speak the same language to the index. The Push API handles documents that contain the following elements:
• URI
• Stamp (optional)
• Metas
• Parts
• Directives
URI
URI is the unique identifier of the document inside the indexed corpus of the connector.
Note: The "URI" described in this document is an opaque string (with optional "/" character hierarchy), and is NOT necessarily a "URI" as per RFC 2396, even if connectors may use regular Internet URI. For example:
Sample URI
Interpreted as
a/b/doc
Folder: a
|_ Folder: b
|_ Document: doc
a/b///doc
Folder: a
|_ Folder: b
|_ Folder: (empty name)
|_ Folder: (empty name)
|_ Document: doc
Stamps
Stamp is a fingerprint that represents the "state", or "version" of the document. Stamps are stored by Exalead CloudView, and retrieved back by the connector to determine which version of the document has been indexed, and whether it should be updated. The document will be updated if the new stamp is not equal to the previous one.
Document metas, not to be confused with hit metas, are pieces of text belonging to a document that have associated values, such as title or size. Document metas are stored either as an index field or as a category. Context is sometimes used as a synonym for document meta.
Parts
Parts represent the binary parts of the content to be converted and indexed like a file. Usually, only one part is needed, but you may need to link some attachments to the content. All parts are merged together and are associated to the same URI.
Note that:
• A PAPI Part has a name (in all Exalead CloudView versions)
• The default Part name is master
• There must be one master Part per PAPI document (for preview)
Thus when a PAPI document has several parts:
• They must all have different names
• One of them must be named master – to set it you can use the com.exalead.papi.helper.part.setAsMaster() member method.
The Part name can be set with the following member methods:
Directives are internal properties embedded in a Exalead CloudView document. They specify either orders on how to treat the document, or information on how to index the document.
Some directives are available at document level:
• datamodel_class: determines the data model class of the document. If this directive is not found, the data model class specified in the source connector configuration will be used. If the source connector does not have a class, we use the data model default class. For example:
final Document myDocument = new Document("docId"); myDocument.setCustomDirective("datamodel_class", "myDocumentClass");
• forcedSlice: overrides the automatic load balancing of documents in the Exalead CloudView slices, by forcing the slice on which documents will be stored.
• sameSlice: (for V6R2014 and higher) forces the document to use the slice of another document by specifying the URI of this document.
Some directives are available at the part level to help the converter determine the content type. Note that the values of these directives cannot be null. Examples of supported directives:
• filename: the filename of the document
• mimeHint: the hint mime parameter
• mime: the forced mime (use with caution)
• encoding: the encoding of the document
The analysis pipeline takes both metas and directives into account to determine how to process a document. For example, to get the file name of a document part, it looks for both the file_name meta and the filename directive, if any. We recommend using the meta when data must be indexed.
Important: When there are several directives in a document, delete operations are processed BEFORE add operations.
Consolidation Server directives
The following table shows the hard-coded order of Consolidation Server directive operations. These directives are created automatically by the Consolidation Server when you push methods to the transformation processors.
To add these directives, you can (using com.exalead.cloudview.consolidationapi.PushAPITransformationHelpers)
• Include them directly within your documents.
• or use pre-aggregation transformation rules in the Exalead CloudView > Consolidation config.
Note: For more details on the Consolidation Server, see the Consolidation Server Guide.
After it has sent all documents from a datasource to Exalead CloudView, a connector must generally keep the index up to date. This process is called synchronization. You can use either stamp-based or checkpoints-based synchronization to synchronize a data-source. For more details, see Implementing Synchronization.
Supported Text Encodings
Parts binary content in text MIME subset may have use any recognized encodings (see the list of available encodings below). The proper encoding should be filled in the part meta-data (encoding or encoding hint).
All other concepts shall only use UTF-8 (or its 7-bit restriction ASCII) as sole encoding, especially all Push API multipart commands.