Configuration : Configuring Search Queries : Configuring and Using Similarity Measures
 
Configuring and Using Similarity Measures
 
Configure the Index for Similarity Queries
Use the #attrsimilar Function in the Search API
Code Samples to Create Similarity Query Prefix Handlers
The #attrsimilar function calculates similarity between a given vector and vectors in the index. For example, you can use it to detect 3D parts with similar shape or size.
#attrsimilar is a query node in the index, which returns all the documents matching the similarity query and calculates the similarity measure. As it does not filter search results at all, you must combine it with a #filter to return only the documents having a similarity higher than a given threshold value.
Note: Similarity is the inverse of distance and calculated as follows: similarity = 1 - distance
Important: The standard way to use #attrsimilar is inside a query template. See Defining Query Templates.
Configure the Index for Similarity Queries
Use the #attrsimilar Function in the Search API
Code Samples to Create Similarity Query Prefix Handlers
Configure the Index for Similarity Queries
This section describes how to index and process signature values to be able to enter similarity queries in the Search API and calculate similarity measures.
Configure the Data Model and the Data Processing
The following procedure explains how to store a signature in an index field represented by the SIGNATURE_INDEX_FIELD variable.
Note: If you need to store multiple signatures, use a dynamic field. To do so, follow step 2 and in the field Advanced options, select the Multivalued and Store meta names properties.
1. In the Administration Console, go to Index > Data Model > Advanced Schema .
2. Add a SIGNATURE_INDEX_FIELD to store signature values.
a. Click Add field .
b. Enter a name, for example, my_signature_bin and set the type to Binary.
c. Set the new field as RAM-based for performance reasons.
3. Go to Index > Data Processing > Analysis pipeline > Document Processors.
4. Add a SimilarStringToPart document processor for part conversion to the pipeline, and in Input from, enter the name of the SIGNATURE_INPUT_META containing all values of the signature vector, for example, my_signature_meta.
This document processor can:
Parse signature values and convert them into binary blob ready to use by the index.
Delete the meta to create a part with the same name.
5. In the Mappings tab, create the mapping between the SIGNATURE_INPUT_META and the SIGNATURE_INDEX_FIELD.
a. Add a mapping source. Give it a name, for example, my_signature_meta and set its type to Part.
b. Add the SIGNATURE_INDEX_FIELD as mapping target. For example, target the my_signature_bin index field.
6. Click Apply.
Test the Configuration
1. Go to the API Console to push a test document.
a. In URI, enter a document name, for example, doctest.
b. In Metas, add your SIGNATURE_INPUT_META in the Name column and a list of float separated by spaces in the Value column.
For example, Name = my_signature_meta, Value = 0.458 -1.68 2
c. Click Push document.
The result must be "The document was successfully pushed."
2. Open the Search API and test the #attrsimilar function with the following syntax:
http://HOSTNAME:BASEPORT+10/search-api/search?eq=%23attrsimilar{name=SIGNATURE_SCORE_OUTPUT}(SIGNATURE_INDEX_FIELD, SIGNATURE_FLOAT_VALUES)&hit_meta.SIGNATURE_OUTPUT_META_NAME.expr=@SIGNATURE_SCORE_OUTPUT.value
In this example:
The SIGNATURE_FLOAT_VALUES variable is set with the float values 0 2 3 (you do not need to surround these values by double quotes "").
The SIGNATURE_SCORE_OUTPUT.value returns a numerical value, which is the similarity score calculated by the similarity function.
<metas>
<Meta name="url">
<MetaString name="value">doctest</MetaString>
</Meta>
<Meta name="SIGNATURE_OUTPUT_META_NAME">
<MetaString name="value">0.4833253026008606</MetaString>
</Meta>
</metas>
For more details about the use of #attrsimilar in the Search API, see the following section.
Use the #attrsimilar Function in the Search API
This section describes the use of the #attrsimilar function in the Search API, after the http://HOSTNAME:BASEPORT+10/search-api/search?eq=%23 part of the URL. Do not forget to remove the # before attrsimilar in the URL.
#attrsimilar Syntax
You can call the #attrsimilar function in a query using the following Search API syntax:
#attrsimilar{name=SIGNATURE_SCORE_OUTPUT}(SIGNATURE_INDEX_FIELD,SIGNATURE_FLOAT_VALUES)
Where:
The SIGNATURE_FLOAT_VALUES (for example, 0 2 3) is compared with all the signatures stored in the SIGNATURE_INDEX_FIELD.
SIGNATURE_SCORE_OUTPUT is the name of the ranking key that stores the similarity measure.
This value can be:
Displayed in all hit metas with this meta value: &hit_meta.SIGNATURE_OUTPUT_META_NAME.expr=@SIGNATURE_SCORE_OUTPUT.value.
Used as a sorting key to display best values first: &s.SIGNATURE_SORT_KEY_NAME.expr=@SIGNATURE_SCORE_OUTPUT.value&s=desc(SIGNATURE_SORT_KEY_NAME).
To use #attrsimilar with a dynamic field containing multiple signature values, use the following syntax:
#attrsimilar{...}(MULTICONTEXT_INDEX_FIELD, "context_signature_1", SIGNATURE_FLOAT_VALUES)
Where:
the MULTICONTEXT_INDEX_FIELD variable corresponds to the dynamic field name containing the signatures.
context_signature_1 is the name of a context in this dynamic field.
Similarity Functions
The similarity measure varies depending on the function used to compare vectors two by two.
Important: With most similarity functions, it is not possible to compare two vectors that do not have the same size. In that case, indexed documents for which the signature vector does not have the same size than the query vector, are not returned to the #attrsimilar node.
To choose a function, use the following syntax:
#attrsimilar{name=SIGNATURE_SCORE_OUTPUT,
function=euclidian_normed}(SIGNATURE_INDEX_FIELD,SIGNATURE_FLOAT_VALUES)
Similarity is calculated as follows: similarity = 1 - distance. For all _normed functions, we can summarize the calculation as:
similarity = 1 <--> close; similarity = 0 <--> far
dist = 1 <--> far; dist = 0 <--> close
For non-normed similarity functions (for example Manhattan, Euclidian, etc.), the calculation is identical but the distance milestones change from [0;1] to [0,Infinity] and similarity is delimited by [-Infinity;1].
The cosine similarity function is the exception, with milestones -1 (unsimilar) and 1 (similar). The angular similarity function allows you to bring cosine similarity between 0 and 1, and be consistent with other similarity functions.
Function
Use
manhattan (default function)
For L1-normalized vectors.
Formula: sim = 1 - (Sum{abs(x1[i] - x2[i])}/2)
The similarity is between 0 and 1.
manhattan_normed
Same as manhattan with L1-normalized vectors first.
Formula: sim = 1 - (Sum{abs(x1[i]/NormL1(x1) - x2[i]/NormL1(x2))}/2), NormL1(x)=sum_i{abs(x[i])}
The similarity is between 0 and 1.
manhattan_dist
For any vectors.
Formula: dist = Sum {abs(x1[i] - x2[i])}
The distance is between 0 and infinity.
multi_manhattan_normed
Compares 2 sets of vectors having the same dimension.
For example, 2 vectors of 8 floats and 3 vectors of 8 floats, using the exclusive min between all MANHATTAN_NORMED distances.
The similarity is between 0 and 1.
euclidian
For L2-Normalized vectors.
Formula: sim = 1 - sqrt((Sum_i{(x1[i]-x2[i])^2})/2)
The similarity is between 0 and 1.
euclidian_normed
Same as euclidian with L2-normalized vectors first.
Formula: sim = NormL2(x)=sqrt(sum_i{x[i]^2})
The similarity is between 0 and 1.
euclidian_dist
For any vectors.
Formula: dist = sqrt(Sum_i{(x1[i]-x2[i])^2})
The distance is between 0 and infinity.
cosine
Angle between 2 vectors.
Formula: COSINE = (Sum {x1[i]*x2[i]/(NormL2(x1)*NormL2(x2))})
Similarity is between -1 and 1, where -1 is unsimilar and 1 is similar.
angular
Formula: arccos(COSINE) / PI
The similarity is between 0 and 1.
dice
For binary bits strings. It computes the intersection between bits to 1 of 2 sequences.
Formula: D = (2*|X inter Y| / (|X| + |Y|))
The similarity is between 0 and 1.
jaccard
For binary bits strings. It computes the intersection between bits to 1 of 2 sequences.
Formula: J = (2*|X inter Y| / (|X| + |Y| - |X inter Y|))
The similarity is between 0 and 1.
Note: jaccard is sometimes called TANIMOTO
hamming
For binary bits strings. It computes the number of ones in an XOR of bits sequence.
Formula: H = 1 - (|XOR(X,Y)|/lenBit(X))
The distance is between 0 and length(vectors).
Combine #attrsimilar with a Filter
To combine #attrsimilar with a filter, use the following syntax:
#filter("@SIGNATURE_SCORE_OUTPUT.value>SIGNATURE_SCORE_THRESHOLD",
#attrsimilar{name=SIGNATURE_SCORE_OUTPUT,function=euclidian_normed}
(SIGNATURE_INDEX_FIELD,SIGNATURE_FLOAT_VALUES))
This syntax allows you to keep only the documents with a similarity measure higher than (>) the SIGNATURE_SCORE_THRESHOLD. For example, you could use a float value like 0.55.
You can also combine several signature computations in one #filter expression. For example:
#filter("@SIGNATURE_1_SCORE_OUTPUT.value>SIGNATURE_1_SCORE_THRESHOLD&&
@SIGNATURE_2_SCORE_OUTPUT.value>SIGNATURE_2_SCORE_THRESHOLD",
#and(#attrsimilar{name=SIGNATURE_1_SCORE_OUTPUT,function=euclidian_normed}
(SIGNATURE_1_INDEX_FIELD,SIGNATURE_1_FLOAT_VALUES),
#attrsimilar{name=SIGNATURE_2_SCORE_OUTPUT,function=euclidian_normed}
(SIGNATURE_2_INDEX_FIELD,SIGNATURE_2_FLOAT_VALUES))
Code Samples to Create Similarity Query Prefix Handlers
The standard use of #attrsimilar is inside a query template using the ELLQL language. For advanced Exalead CloudView users who want to manage similarity queries in UQL, you can adapt the following code samples.
To create your similarity query prefix handler, adapt the following code samples to your use case and package your custom prefix handler as a CVPlugin. For more information, see in the Exalead CloudView Programmer's Guide.
Code for simple attrsimilar prefix handler (SimpleAttrSimilarPrefixHandler.java)
package com.exalead.example.search;

import java.util.ArrayList;
import java.util.List;
import java.util.Map;

import org.apache.log4j.Logger;

import com.exalead.mercury.component.CVComponent;
import com.exalead.mercury.component.config.CVComponentConfigClass;
import com.exalead.search.query.QueryContext;
import com.exalead.search.query.QueryProcessingException;
import com.exalead.search.query.node.AttrSimilar;
import com.exalead.search.query.node.If;
import com.exalead.search.query.node.IndexOptions;
import com.exalead.search.query.node.Node;
import com.exalead.search.query.node.NodeVisitor;
import com.exalead.search.query.node.PrefixNode;
import com.exalead.search.query.node.UserQueryChunk;
import com.exalead.search.query.prefix.CustomPrefixHandler;
import com.exalead.search.query.util.LongOrDouble;

@CVComponentConfigClass(configClass=SimpleAttrSimilarPrefixHandlerConfig.class)
public class SimpleAttrSimilarPrefixHandler extends CustomPrefixHandler implements
CVComponent {
private static final Logger log = Logger.getLogger(SimpleAttrSimilarPrefixHandler.class);
private final SimpleAttrSimilarPrefixHandlerConfig config;
public SimpleAttrSimilarPrefixHandler(SimpleAttrSimilarPrefixHandlerConfig config) {
super(config);
this.config = config;
}

@Override
public Node handlePrefix(Phase phase, PrefixNode node,
NodeVisitor parentVisitor, QueryContext queryContext)
throws QueryProcessingException {
if(phase == Phase.POST_PARSE){
if (node.content instanceof UserQueryChunk) {
UserQueryChunk uqc = (UserQueryChunk) node.content;
String[] tokens = uqc.value.split(",");
IndexOptions options = new IndexOptions();
LongOrDouble filterValue = null;

String signatureField = config.getIndexField();
String signatureContext = null;
String function = config.getDistance();

//let's parse node options to override config ones
if(uqc.indexOptions != null && uqc.indexOptions.getRawOptions() != null){
for(Map.Entry<String, String> entry : uqc.indexOptions.getRawOptions().entrySet()){
if("filter_value=".equals(entry.getKey())){
filterValue = new LongOrDouble(Double.parseDouble(entry.getValue()));
} else if("index_field=".equals(entry.getKey())){
signatureField = entry.getValue();
} else if("function=".equals(entry.getKey())){
function = entry.getValue();
} else {
options.addRawOptions(entry.getKey(), entry.getValue());
}
}
}

if(filterValue == null && config.getFilterValue() != null){
filterValue = new LongOrDouble(config.getFilterValue());
}
queryContext.query.hitOrder.clone();
options.addRawOptions("function=", function);

String vfName = tokens[0];
List<LongOrDouble> signature = parseSignature(tokens[1]);

if(signatureField != null && signatureField.contains("@")){
String[] signatureFieldTokens = signatureField.split("@");
signatureField = signatureFieldTokens[1];
signatureContext = signatureFieldTokens[0];
}

return createAttrSimilarNode(vfName, options, signatureField, signatureContext,
signature, filterValue);
}
}
return node;
}

private Node createAttrSimilarNode(String vfName, IndexOptions options,
String signatureField, String signatureContext,List<LongOrDouble> signature,
LongOrDouble filterValue) throws QueryProcessingException{
Node res = null;
options.addRawOptions("name=",vfName);
if(signatureField == null){
log.error("Missing signature field config or option");
throw new QueryProcessingException("Missing signature field config or option");
}
//the #attrsimilar node
res = new AttrSimilar( signatureField, signatureContext, signature, null, null, null, options);

if(filterValue != null){
//here we create the surrounding #filter node
String filter = "@"+vfName+".value>="+filterValue.toString();
IndexOptions opts = new IndexOptions();
res = new If(res,filter, opts);
}
return res;
}

/**
*
* @param signature a space-separated list of double
* @return The List<LongOrDouble> containing the signature
*/
private List<LongOrDouble> parseSignature(String signature){
String[] tokens = signature.trim().split(" ");
List<LongOrDouble> res = new ArrayList<LongOrDouble>(tokens.length);
for(int i = 0; i<tokens.length; i++){
res.add(new LongOrDouble(Double.parseDouble(tokens[i])));
}
return res;
}

}
Code for simple attrsimilar prefix handler configuration (SimpleAttrSimilarPrefixHandlerConfig.java)
package com.exalead.example.search;

import com.exalead.config.bean.IsHidden;
import com.exalead.config.bean.IsMandatory;
import com.exalead.config.bean.PropertyDescription;
import com.exalead.mercury.component.config.CVComponentConfig;
import com.exalead.search.query.util.LongOrDouble;

public class SimpleAttrSimilarPrefixHandlerConfig implements
CVComponentConfig {

public SimpleAttrSimilarPrefixHandlerConfig() {
}

private String indexField;
private String distance = "manhattan_normed";
private LongOrDouble filterValue;

@IsMandatory(false)
@PropertyDescription("The binary index field that contains the signatures."
+ "If it's a dynamic field (multi valued and storing meta names) "
+ "use the following syntax: signatureName@indexFieldName. "
+ "This value can be overridden in query option \"index_field\".")
public void setIndexField(String indexField) {
this.indexField = indexField;
}

public String getIndexField() {
return indexField;
}

@PropertyDescription("The distance function to use. "
+ "Some possible values: manhattan, manhattan_normed, euclidian, "
+ "euclidian_normed, cosine. "
+ "Normed versions of the distance must be used when the signatures "
+ "in the index have not been normed before indexing. "
+ "This value can be overridden in query option \"function\".")
public void setDistance(String distance) {
this.distance = distance;
}

public String getDistance() {
return distance;
}


@IsMandatory(false)
@PropertyDescription("The minimum similarity score that a hit must have "
+ "to match the query. "
+ "This value will generally between 0 and 1. "
+ "It can be overridden in query option \"filter_value\". "
+ "If empty, there will be no filtering based on the score.")
public void setFilterValue(Double filterValue) {
this.filterValue = new LongOrDouble(filterValue);
}

public Double getFilterValue() {
if(filterValue == null){
return null;
}
return filterValue.getDouble();
}

@IsHidden
public LongOrDouble getLongOrDoubleFilterValue() {
return filterValue;
}

}