Connectors : Default Connectors : Crawler Connector : Configuring the Crawler
 
Configuring the Crawler
 
Configuring the Crawler
Define the Groups of URLs to Crawl
Define Crawl Rules
Specify the File Name Extension and MIME Types to Crawl
Sample Configurations for Different Use Cases
This section describes how to configure the Crawler Configuration options and several sample use cases.
Configuring the Crawler
Define the Groups of URLs to Crawl
Define Crawl Rules
Specify the File Name Extension and MIME Types to Crawl
Sample Configurations for Different Use Cases
Configuring the Crawler
This section implies that the Crawler connector has already been added.
See Creating a Standard Connector.
Crawler General Options
Option
Description
Smart refresh
When enabled, the crawler continuously goes through the list of already crawled URLs and adds them to the crawling queue so they are refreshed.
The URLs that have been refreshed or crawled recently - that is to say that have not crossed the Min age threshold - are not added to the crawling queue immediately.
The URLs added by the refresh loop have a lower priority than the new URLs discovered during the crawl, so the refresh process does not slow down the discovery of new pages. For more details, see How Priorities Work.
Min Age / Max age
Documents older than Min Age and younger than Max age are automatically refreshed depending on their update frequency. Documents that change often will be refreshed faster than documents that have never changed.
In other words, documents are automatically refreshed according to their update frequency, but never faster than Min Age, and never slower than Max age.
Reasonable values for a small intranet site are min = 1 hour, max = 1 week. Reasonable values for a big web crawl are much higher, at minimum 1 day and maximum 6 months or 1 year.
No. of crawler threads
Specifies the number of crawler threads to run simultaneously.
If the crawler overloads your network, specify fewer threads.
Web server throttle time (ms)
Specifies the lapse of time between queries (number of milliseconds).
By default, the connector sends one query at a time to a given HTTP server. This avoids overloading the HTTP server.
Note: The throttle time is ignored if you select the Aggressive mode. However, this may generate a heavy load on Web servers. Use it only when crawling your own servers.
Ignore robots.txt
Ignores the robots.txt that may be defined on the site you want to crawl.
Caution! By default, the crawler follows the rules defined in the robot.txt to crawl the HTTP server.
Detect near duplicates
Detects similar documents.
When a document is detected as identical to a previously crawled page (when the content is the same), it is not indexed and links are not followed.
When it is only similar to a previously crawled document, it is indexed but links are not followed.
This can avoid looping on infinitely growing URLs with duplicate content, and speeds up the crawl on sites where content can be accessed from different URLs.
Detect patterns
Detects sites based on specific patterns, such as CMS, parking site, SPAM, and adult content. It then applies the specific crawl rules defined for these sites. For example, it can detect search or login pages on forums that the connector must not crawl.
Convert processor
Searches for links in binary documents.
To save CPU and memory, disable this option when crawling sites where there are no new links in binary documents.
Site collapsing
Activates site collapsing.
By default, the site collapsing ID is based on the host part of the URL. All URLs belonging to the same host share the same ID and are shown as one collapsed result. You can configure the site ID to include more segments of the URL's path, to distinguish sites on the same host. This can be done by specifying the recursion with the Depth parameter. By default, the value is 0 corresponds to the host part of the URL.
For more details, see About Site Collapsing.
Default behavior
Allows you to specify a combination of index and follow rules for URLs that do not have any rules.
Accept – crawls URLs that do not match any rules. Beware, it can crawl the entire internet.
Index – indexes the contents of the documents at this URL.
Follow – follows the hyperlinks found in the documents at this URL. This finds content outside of this URL. Beware, it can crawl the entire internet.
Follow roots – follows root URLs that do not match any rules.
Default data model class
Allows you to specify the class in which the documents must be indexed for URLs that do not have any data model class rules.
License document type
Select the type of documents that this connector produces.
Blacklisted session ID
Allows you to specify the session IDs to be excluded (blacklisted) from the matching URLs.
Many sites use session IDs which add parameters to the URLs randomly. This is troublesome since they can create different links for the same pages.
For example, if you want to blacklist parameter 's' from URLs, enter s in the field. Consequently, if it finds a link to:
http://foo.com/toto?s=123456789&t=2
... it crawls:
http://foo.com/toto?t=2
Define the Groups of URLs to Crawl
The Crawler connector can have one or more groups containing root URLs.
Adding groups with Add group is a good means of organizing your sources by topic. For example, you can create a news group, to gather the source URLs of online newspapers.
1. In the Groups section, click Add group to add a new group.
2. In the URLs pane, enter the URLs of the sites OR the parts of the sites you want to crawl.
Select Site to crawl the entire site. If you crawl an entire site, there is no need to define crawl rules.
- if the url is http://foo.com/bar it crawls only pages whose urls begin with http://foo.com/bar
- if the url is http://foo.com it crawls only urls beginning with http://foo.com/ OR http://www.foo.com/
Note: If the home page is a redirection or a meta redirection to another site, it follows it and crawls the destination site. Except for the home page, it does not follow redirections to pages external to the site.
If you want to restrict the crawl, clear the Site option and define crawl rules.
Note: If you have used the Site option and then want to use advanced rules, clear the crawler documents first, otherwise unexpected behavior may occur.
3. If you define several source URLs, you can sort their crawling priority from the Priority select box. See How Priorities Work.
4. Add more groups and define the URLs to crawl.
Define Crawl Rules
If you do not want to crawl an entire site, you can define precisely how to crawl each root URL in the HTTP source, using crawl rules.
These rules allow you to specify the action to perform for each url. For example, follow the hyperlinks and index the contents found on the pages.
Remind that the crawler applies the rule to all URLs in all groups. The pattern is a prefix matching the URL's beginning. It must only match the URLs of its group, otherwise unexpected behavior may occur.
1. Expand Advanced rules.
2. Add as many rules as required and for each:
a. Specify a URL pattern.
b. Define the action to take when crawling URLs that match this specific pattern.
Action
Description
Index and follow
Indexes the contents at this URL. The links found in the page are followed and crawled.
Index and don't follow
Indexes the contents but ignore the hyperlinks found in the page.
Follow but don't index
Follows the hyperlinks found in the pages at this URL but do not index the content.
Index
Indexes the contents of the pages at this URL.
Follow
Follows the hyperlinks found in the pages at this URL. This finds content outside of this URL.
Don’t index
Does not index the contents at this URL.
Don’t follow
Ignores the links found in the page.
Ignore
Ignores the defined URL completely.
Source
For compatibility with version 5.1 where several sources could be defined for the same crawler.
Add meta
Adds a meta as a key/value pair to flag the contents and hyperlinks of the pages at this URL.
Priority
If you define several crawl rules, you can sort their priority from the Priority select box. This changes the priority of URLs matching the pattern. See How Priorities Work.
Data model class
Allows you to specify the data model class of the documents pushed to the index.
For example, we can crawl a single URL “http://www.example.com” and define the following patterns and actions:
http://www.example.com/ index and follow
http://www.example.com/test ignore
http://www.example.com/test/new index and follow
http://www.example.com/listings/ don’t index
Note: You can quickly check the effect of your crawl rules (whether they work or not) in the Test rules section. This is interesting when rules are complex and you want to make sure that they do not break anything before applying the configuration to the crawler. Select Advanced mode to specify the expected behavior for each URL, and test rules without applying the configuration. Check boxes turn to green when the actual behavior matches the expected one, otherwise, they turn red.
3. Use the up and down arrows on the right of the Actions field to sort the rules. Precedence is given to the last matching rule (the last rule has more priority).
4. Expert users can also click Edit rules as xml to fine-tune rules manually. See Advanced Configuration.
Specify the File Name Extension and MIME Types to Crawl
Exalead CloudView gives you the possibility of specifying the file name types that the crawler must take into account. This allows you to focus only on interesting files, and reduce the amount of information to crawl.
By file name types, we refer to:
The file 'extensions' that you can filter before crawling URLs.
The 'mime types' that Exalead CloudView identifies once a document has been fetched.
1. In Filename extension and MIME types, select the extensions and mime types to include or exclude.
2. Click Apply.
Sample Configurations for Different Use Cases
Crawl One Intranet Site
When crawling only intranet sites, the configuration is simple, you must enter root URLs and maybe some rules to avoid indexing useless documents.
The server is friendly, so you can remove crawl speed limits. In that case, we say that you can configure the crawler in "aggressive" mode, which means that there is no throttle time between requests. This allows faster crawl and refresh.
You can also enable "smart refresh" to index new documents quickly. This option refreshes pages that change more often automatically, though it needs some adaptation time.
If required, you can configure a deterministic refresh, to ensure that all pages refresh in a given period. You have to check that the refresh period is long enough to recrawl all pages, using the server response time. You can schedule deterministic refresh using the product's scheduler. See Refresh.
Crawl an Intranet Site with SSO Authentication
The crawler configuration is the same as above, but the fetcher needs some authentication configuration too.
Follow the instructions. Usually you can copy-paste the URL patterns from the crawl rules for the authentication configuration.
Crawl Many Intranet Sites
It is better to follow these recommendations when crawling many different Intranet sites.
Usually, one crawler containing all the sites is more efficient than many crawlers. Indeed, the crawler can share resources (CPU, memory, threads), whereas two independent crawlers consume twice the resources without being faster.
The only good reasons to separate the sites in several crawlers are the following:
You need different crawler options (throttle time, aggressive mode, refresh).
You need documents indexed in a different build group.
You deploy different crawlers on different hosts for performance.
Deterministic refresh is still an option, see Crawl One Intranet Site.
The number of threads is important. Having too many threads uses too much memory, and few are able to crawl fast enough. To find the appropriate number, remember that you can crawl a host with only one thread at a time, so threads above the total number of hosts are useless. You can also fine-tune using the performance graphs in the Monitoring Console: the crawler has a graph showing the average number of idle threads. If many threads are idle, you can lower their number.
Crawl Specific Internet Sites
If you need to crawl specific pages on a small number of sites (for example, product pages on e-commerce sites, or articles on a news site), write crawl rules carefully to define precisely which URLs to index, and which URLs to follow but do not index.
The easiest way is to use the Site option. This way, the crawler crawls every URL in the site. With some additional rules, you can select which pages to index among the crawled documents:
A rule with the same pattern as the whole site, and don't index action overwrites the default for the Site option. Now the site is still crawled, but no longer indexed.
Enough rules to match the pages you want to index with the index action. This overwrites the previous rule only for the wanted pages. You can also use the index and don't follow action if the links from these pages are not required.
To discover new links earlier without missing any, use Smart refresh or a specific deterministic refresh on intermediate listing pages. For more information about deterministic refresh, see Refresh.
Be aware of the limited crawl speed (usually 1 document every 2.5 seconds) on the internet, which defines an upper bound on the number of crawled pages per day. You can tune the refresh periods with the Min/Max document age options. To optimize the crawl behavior:
Increase the acceptable time to slow the refresh,
Decrease it if you have available resources to enhance freshness.
Crawl Many internet Sites
If you need to crawl a large or very large list of websites, configuring them in the Administration Console becomes a chore.
The easiest way to achieve this task involves several features:
Site mode,
Root sets.
The Site mode allows you to avoid setting crawl rules per site.
The root set configuration allows the crawler to refresh the list of sites to crawl dynamically, from a file outside the configuration.
The root set URL locates the file containing the site list. It can be a regular file:
On the local filesystem (file://),
In the product's data directory (data://),
In the configuration directory (config://latest/master/),
Or even a remote server (http://).
The file can be as simple as one URL per line.
You can also store the root set in the resource manager, using a resourcemanager:// URL. For more information, see the resource manager's documentation.
Refresh works better with Smart refresh, and longer time periods (1 week to 1 month), to avoid unnecessary aggressive refresh.
The crawler can have as many crawl threads as the hardware resources allow. 20 to 100 threads are generally used in large crawls.
You can categorize sites using the root set group (split rootsets into several smaller sets, identified by their group), or on a per-site basis using forced metas. Example of root set using forced metas:
http://www.example1.com/ category=news theme=sports relevancy=100
http://www.example2.com/ category=blogs relevancy=10
To optimize resource usage, you can configure a maximum number of crawled pages per site. The default value is 1 million pages.