Connectors : Default Connectors : Crawler Connector : Advanced Configuration
 
Advanced Configuration
 
Crawl Rules
Crawl Rule Actions
How Priorities Work
Error Handling
You can configure the Crawler connector using the Crawl Manager of the Exalead CloudView Management APIs. Therefore you use the <DATADIR>/config/CrawlConfig.xml file to perform advanced configuration of the Crawler connector.
You can edit this XML file from the Administration Console by clicking Edit as XML (Experts only) links.
Crawl Rules
Crawl rules perform Boolean matching on a URL string therefore the rules operate on any part of a URL (scheme, host, path, query, fragment), and match strings or regular expressions.
You can configure the behavior of the match to define whether it is case-sensitive or left and right anchored. For example, a rule of kind ‘Length’ matches the length of a part of the URL against an integer range. Shortcuts are available as shown in the table below.
Note: For all parts of the url, the val attribute is a regular expression. However, you must escape the special characters of a regexp ^$.*+?[](){}| with a \.
Rule category
Rule key
Example
Description and parameters
Boolean
And, Not, Or
<Or>...</Or>
Performs Boolean matching for the nested rules defined. Commonly used for Boolean matching on URL strings.
generic
Atom
Syntax:
<Atom field="" kind="" norm="" value="" />
Example:
<Atom field="path" kind="prefix" norm="none" value="/watch" />
This generic rule needs to define the type of match:
The field defines the type of element to match. Possible values are url|scheme|host|path|query
kind defines the part of the field to use for the match. It can have the values: exact|prefix|suffix
inside where you specify a regexp and its anchoring in val.
length where you specify the length of a field ([:10], [11:12], [30:]) in val.
norm impacts the normalization level. The default is a case insensitive match that corresponds to non. Possible values: norm|lower|none.
value is a regexp that must be matched with the links during crawl, based on the field and kind parameters.
shortcut
Domain
Exact match on domain. For example,
<Domain val="foo.com" />
Matches http://foo.com/ and http://bar.foo.com but not http://barfoo.com
This is a shortcut for this combination of rule:
<Or><Atom field="host" kind="suffix" value=".foo.com" /><Atom field="host" kind="exact" value="foo.com" /></Or>
Path
<Path val="/cgi-bin" />
This rule is a shortcut for atom path-prefix. It is a left anchored match on the path.
Ext
<Ext val=".gif" />
This rule is a shortcut for atom path-suffix. It is a right anchored match on the path.
Host
<Host val="www.wikipedia.org" />
Performs an exact match on host.
This rule is a shortcut for atom host-exact.
Url
<Url val="http://en.wikipedia.org/wiki>
This rule is a shortcut for atom url-exact.
Scheme
<Scheme val="http" />
This rule is a shortcut for atom scheme-exact. Possible values: http|https
Query
<Query val="q=foo" />
This rule is a shortcut for atom query-exact. It performs an exact match on the query.
InQuery
<InQuery val="q=foo" />
This rule is a shortcut for atom query-inside. It performs a match on the query not anchored.
Length
<Length field="path" val="[30:]" /> matches URLs with a path length >= 30
This rule is a shortcut for atom field-length. It specifies the length of the URL path.
Note: All objects belong to the namespace: exa:com.exalead.actionrules.v21
Crawl Rule Actions
The table below lists the various crawl rule actions that you can use with their corresponding XML tags.
For example, a crawl rule with an index and follow action is written as follows:
<Rules group="example" key="auto">
<Rule>
<ar:Atom litteral="true" value="http://www.example.com/"
norm="none" kind="prefix" field="url"/>
<Index/>
<Follow/>
<Accept/>
</Rule>
</Rules>
Action
adds XML Tags...
Index and follow
<Index/>
<Follow/>
<Accept/>
Index and don’t follow
<Index/>
<NoFollow/>
<Accept/>
Follow but don’t index
<NoIndex/>
<Follow/>
<Accept/>
Index
<Index/>
<Accept/>
Follow
<Follow/>
<Accept/>
Don’t index
<NoIndex/>
Don’t follow
<NoFollow/>
Ignore
<NoIndex/>
<NoFollow/>
<Ignore/>
Source
<Source name=""/>
Add meta
<AddMeta value="" name=""/>
Priority
<Priority shift=""/>
Possible values:
-2 = Highest
-1 = Higher
0 = Normal
1 = Lower
2 = Lowest
How Priorities Work
When Smart refresh is enabled, the crawler scheduler may contain up to 6 URL sources; 5 fifos and 1 refresh source. If Smart refresh is disabled, you can use the Exalead CloudView scheduler to refresh sources at a specific time; URLs are sent to the fifo:index.
URL source
Number
Content
Default weight (priority)
fifo: user
0
Only user-submitted root URLs with priority 0, and roots with default priority.
10000
fifo: redir
1
Targets of redirections.
2000
fifo: index
2
Documents that are indexed but whose links are not be followed.
1000
fifo: index_follow
3
Documents that are indexed and whose links are followed.
100
fifo: follow
4
Documents whose links are followed, but which are not indexed.
10
smart refresh source
5
Documents to refresh.
1
The crawler scheduler picks URLs in each URL source according to their weight. The higher the weight of a fifo, the more links are picked.
If you define a crawl rule with a priority action, the priority shift is ascending or descending depending on the value:
-2 = Highest,
-1 = Higher,
0 = Normal,
1 = Lower,
2 = Lowest
Error Handling
Many errors can occur when crawling a URL.
These errors are split into the following categories:
Permanent errors
HTTP 404 Not Found
HTTP 304 Not Modified when GET was not conditional (inconsistent server behavior)
Redirection to malformed URL
HTTP 5XX server errors
HTTP 4XX content errors
DNS permanent errors (host not found, etc.)
Other connection errors
Temporary errors
Connection time out, connection reset by peer, connection refused
HTTP 503 error with Retry-After header
DNS temporary errors (no answer)
The error status is remembered for each URL. When a URL triggers a permanent error, if a document was indexed for that URL, a deletion order is immediately sent to the Push API.
Documents in error are refreshed like other documents:
When a URL is refreshed and triggers too many temporary errors, if a document was indexed for that URL, a deletion order is sent to the Push API, and its status is removed too. It will not be crawled again unless the crawler comes across a new link to it.
When a URL is refreshed and triggers permanent errors, the URL status is removed. It will not be crawled again unless the crawler comes across a new link to it.