Configuring the XML Advanced Connector

Connectors : Default Connectors : XML Connectors : Configuring the XML Advanced Connector

About the Processor Pipeline

Configuring the PAPI Document Processor

Configuring the XML Element Processor

Configuring the XPath Processor

Configuring the XSLT Processor

Configuring the Tee Processor

Configuring the XML Attach Processor

Configuring the Child Split Processor

Configuring the XPath Split Processor

Configuring the Custom Processor

This connector allows you to run several processors on XML documents. Some processors in the pipeline can be chained, that is to say that the result of one processor is fed to another. Processors can be split into two categories:

• Regular processors that handle one document and extract data from it.

• Split processors that extract chunks of XML data from one (possibly large) document and delegate chunk processing to other processors.

This section details the processors bundled with the connector. These are all regular processors unless stated otherwise.

About the Processor Pipeline

Configuring the PAPI Document Processor

Configuring the XML Element Processor

Configuring the XPath Processor

Configuring the XSLT Processor

Configuring the Tee Processor

Configuring the XML Attach Processor

Configuring the Child Split Processor

Configuring the XPath Split Processor

Configuring the Custom Processor

About the Processor Pipeline

To configure the XML advanced connector, you need to:

• Configure the Global Configuration properties, see Property Descriptions .

• Create a pipeline with the processors described in this section.

In the pipeline:

• Every processor has an id.

• The connector needs to know where the chain of processors begins, so an Entry point id has to be supplied.

• If a processor is not the entry point, and not referenced by another processor, it will never be run.

Configuring the PAPI Document Processor

This processor extracts data from documents that look like:

<PAPI_document>
<PAPI_meta name="name">value</PAPI_meta>

</PAPI_document>

1. Expand PAPI document processors, and click Add item.

2. In Processor's id, specify the identifier of the processor that must process documents.

3. Click Apply.

Configuring the XML Element Processor

This processor performs a by-name node (or attribute) selection and push their values as metas. The connector selects nodes and attributes according to an include and an exclude list.

The following rules apply:

• Children of included nodes are pushed as separate metas,

• Children of excluded nodes are completely ignored (as well as their attributes).

1. Expand XML element processor, and click Add item.

2. Select Concatenate text element, to prevent the text to be split when the XML code contains predefined character entities (like &, >, ", etc.).

For example, if you index the tag: <THEME_FR>Global Procurement & Supply chain</THEME_FR> without the Concatenate text element option, you get several metadata instead of a single one. Instead of a single metadata: THEME_FR = Global Procurement & Supply chain, you get 3 metadata:

◦ THEME_FR = Global Procurement

◦ THEME_FR = &

◦ THEME_FR = Supply chain

3. For Processor’s id, specify the identifier of the processor that must process documents. This id must be identical to the Entry point id.

4. In Include elements, specify element mappings:

a. Element: Name of the XML element to extract. This can also represent an attribute when specifying a string in the form node@attribute. For example: book@title.

b. Meta: Name of meta associated with the node content. If you do not specify any meta, the connector uses the name of the element or attribute to infer the meta name. For example, if Element is set to book, the node text is pushed in the book meta, or if Element is set to book@title, the attribute text is pushed in the title meta.

5. Optionally, in Exclude elements, you can specify a list of node names that must be skipped by the processor.

6. Click Apply.

Configuring the XPath Processor

This processor associates results of XPath selection to metas.

1. Expand XPath processors, and click Add item.

2. In Processor's id, specify the identifier of the processor that must process documents. This id must be identical to the Entry point id.

3. In XPath, specify Xpath mappings:

a. XPath: XPath expression

b. Meta: Name of meta associated with the XPath results.

4. Click Apply.

Configuring the XSLT Processor

This processor performs XSL transformations on XML documents and passes the result to its following processors.

1. Expand XSLT processors, and click Add item.

2. In Stylesheet, specify the file path of the stylesheet file.

3. In Processor's id, specify the identifier of the processor that must process documents. This id must be identical to the Entry point id.

4. In Next processor’s id, specify the list of processor identifiers to run with the transformed XML.

5. Click Apply.

Configuring the Tee Processor

This processor relays processing events to a number of following processors. No data extraction is performed on the XML document.

1. Expand Tee processors, and click Add item.

2. In Processor's id, specify the identifier of the processor that must process documents. This id must be identical to the Entry point id.

3. In Following processor ids, specify the list of the next processor identifiers to run.

4. Click Apply.

Configuring the XML Attach Processor

This processor dumps the current XML document and add it as a part or a meta. You can insert this processor anywhere in the pipeline. It will relay processing events to the following processor.

1. Expand XML attach processors, and click Add item.

2. In Next processor's id, specify the identifier of the next processor in the pipeline.

3. In Part name, specify the name of the part that will include the current document.

4. In Meta name, specify the name of the meta that will include the current document.

5. In Processor's id, specify the identifier of the processor that must process documents. This id must be identical to the Entry point id.

6. Click Apply.

Configuring the Child Split Processor

This processor splits XML documents in several chunks. Every chunk is then handled as a separate document by a child processor. Every first-level child of the XML document will be extracted.

For example, if you want to index all books from the following file:

<books>
<book><title>title..</title></book>
<book><title>title..</title></book>
</books>

You can use a Child split processor to extract every <book/> node, and use an XPathProcessor to extract the title with the XPath expression /book/title.

1. Expand Child split processors, and click Add item.

2. In Next processor's id, specify the identifier of the next processor in the pipeline that must process chunks.

3. In Processor's id, specify the identifier of the processor that must process documents. This id must be identical to the Entry point id.

4. Click Apply.

Configuring the XPath Split Processor

This processor splits XML documents into several chunks according to a small subset of an XPath expression.

Your XPath expression must only include direct child selection, for example, /a/b/c selects the <c> node in <a><b><c></c></b></a> expression.

1. Expand Custom processors, and click Add item.

2. In XPath, specify a simple Xpath expression.

3. In Next processor's id, specify the identifier of the next processor that must process chunks.

4. In Processor's id, specify the identifier of the processor that must process documents. This id must be identical to the Entry point id.

5. Click Apply.

Configuring the Custom Processor

This processor uses user-supplied Java classes to handle XML documents.

For instructions on how to write such processors, see Develop Regular and Split Processors.

1. Expand Custom processors, and click Add item.

2. In Java class, specify the java class of your processor.

3. In Processor's id, specify the identifier of the processor that must process documents. This id must be identical to the Entry point id.

4. Optionally, in Custom parameters, specify a list of key and values that must be passed to your processor as additional configuration.

5. Click Apply.