Connectors : Default Connectors : Crawler Connector : Troubleshooting the Crawler Connector
 
Troubleshooting the Crawler Connector
 
Use the Crawler http.log
Dump the Crawler Repository
Performance Monitoring
This section describes how to monitor and troubleshoot the Crawler connector.
Use the Crawler http.log
In the Administration Console, the Crawler connector's crawl logs are available in the CRAWLER CONNECTOR > Logs tab.
They log all the actions performed for the URLs crawled by the crawler, with their HTTP response status and configuration messages. Stacks are also printed when unexpected exceptions occur.
These logs are saved in:
<DATADIR>/run/crawler-<crawler-server>/<crawler-name>.http.log
For example, for the default crawler server exa0, the path is: <DATADIR>/run/crawler-exa0/HTTP_Crawler.http.log
Figure 4. Display of Crawler Connector logs in the Administration Console
The http.log file contains one line for each processed URL, with lots of information about the result of its processing. This uses a logger named crawllog-<crawlername> at the level info, any default log level above info disables the http.log.
Field
example
date
2013/01/11-14:32:10
doc_status
OK
http_code
200
http_message
OK
url
http://www.reddit.com/r/announcements/
referrer
http://www.reddit.com/
messages
[default: accept,index,follow (site)] [content-type=text/html] [content-length=13347] [mime=text/html] [size=91204B] [fetchDuration=1655ms] [langInMeta=en] [lang=en] [mime.html.simhash=8857254644444405862] [mime.html.nbToken=1138] [postedLinks=297]
More information about these log fields:
doc_status possible values are:
REDIR IGNORED
TEMPORARY_ERROR
PERMANENT_ERROR
referrer is - when unavailable
messages are printed chronologically as follows:
1. Preprocessor rules (default rules or rule<name/preprocessor>:rules); latter rules override former ones, accept/ignore decides whether document is fetched.
2. If the document is not ignored, the fetch result contains content-type, content-length as returned by server, verified mime, real document size (content-length is size before eventual decompression), fetchDuration in milliseconds.
3. If the document is fetched, the processor outputs: language, simhash (similarity hash).
4. Postprocessor rules, index/noindex decide whether document must be indexed, keep track stores document in box only.
5. PostedLinks indicate the number of followed links (follow rules decides whether to follow any link from current document)
You can also consult the process logs of the Crawler Server from the Troubleshooting > Logs menu, by selecting a crawler server (for example, crawler-exa0) from Processes and clicking Add. These logs are saved in the <DATADIR>/run/crawler-<Crawler Server>/log.log file.
Dump the Crawler Repository
You can use the cvdebug command-line tool, which allows you to perform debugging operations, to dump the entire or a part of the crawler repository.
Go to the <DATADIR>\bin directory and run the following command:
cvdebug box dump crawler=<CRAWLER NAME> [prefix]
For:
crawler – enter the name of the crawler to dump
[prefix] – Only dump URLs beginning with this prefix.
The format is space-separated, with a header.
Field
Description
url
Document URL
redirUrl
URL of redirection if any
docType
Document type (HTML, redirect, ANY if unknown)
httpStatus
HTTP status: STATUS_OK, STATUS_PERMANENT_ERROR, STATUS_FORBIDDEN
language
Document language in ISO 639-1 alpha 2 format
lastSeenChangeDate
The last time the document has been seen as changed by the crawler
refreshDate
The last time the document has been refreshed by the crawler
firstCrawlDate
The first time the document was crawled
size
Document size in bytes
normalizedContentType
Document mime type (text/html, flash etc.)
encoding
Document encoding (utf-8, iso-8859-1, etc.)
truncated
Has the document been truncated? It occurs when the document size is bigger than the limits defined in <DATADIR>/config/FetchConfig.xml
siteRoot
Specifies whether the document is a site root
site
Specifies under which site root this page has been crawled
rootGroup
Specifies the group this document belongs to
rulesGroup
Specifies which crawl rules group apply to this document
source
Specifies the source meta value as defined by crawl rules for which the selected action is Source. See Crawl Rule Actions.
dataModelClass
Specifies the dataModelClass meta value as defined by crawl rules or which the selected action is Data model class. See Crawl Rule Actions.
Performance Monitoring
In the Operations & Inspection tab, you see the crawler status as well as several counts on the number of indexed documents, crawled documents, remaining URLs to crawl, and many probes.
The Number of documents in crawler storage, is the number of crawled documents. Usually, this number is higher than the number of documents indexed by the crawler. Some documents may have not been indexed because they were empty or reported errors, or were redirection errors. The crawler storage keeps a copy of all crawled documents to know which pages must be updated.
Note: In the Monitoring Console, crawlers have many graphs showing their overall state and activity under HOSTNAME > Services > Exalead > Connectors > CRAWLER NAME. You can see all values through the probes.