Troubleshooting the Crawler Connector

They log all the actions performed for the URLs crawled by the crawler, with their HTTP response status and configuration messages. Stacks are also printed when unexpected exceptions occur.

Field	example
date	2013/01/11-14:32:10
doc_status	OK
http_code	200
http_message	OK
url	http://www.reddit.com/r/announcements/
referrer	http://www.reddit.com/
messages	[default: accept,index,follow (site)] [content-type=text/html] [content-length=13347] [mime=text/html] [size=91204B] [fetchDuration=1655ms] [langInMeta=en] [lang=en] [mime.html.simhash=8857254644444405862] [mime.html.nbToken=1138] [postedLinks=297]

You can also consult the process logs of the Crawler Server from the Troubleshooting > Logs menu, by selecting a crawler server (for example, crawler-exa0) from Processes and clicking Add. These logs are saved in the <DATADIR>/run/crawler-<Crawler Server>/log.log file.

Field	Description
url	Document URL
redirUrl	URL of redirection if any
docType	Document type (HTML, redirect, ANY if unknown)
httpStatus	HTTP status: STATUS_OK, STATUS_PERMANENT_ERROR, STATUS_FORBIDDEN
language	Document language in ISO 639-1 alpha 2 format
lastSeenChangeDate	The last time the document has been seen as changed by the crawler
refreshDate	The last time the document has been refreshed by the crawler
firstCrawlDate	The first time the document was crawled
size	Document size in bytes
normalizedContentType	Document mime type (text/html, flash etc.)
encoding	Document encoding (utf-8, iso-8859-1, etc.)
truncated	Has the document been truncated? It occurs when the document size is bigger than the limits defined in <DATADIR>/config/FetchConfig.xml
siteRoot	Specifies whether the document is a site root
site	Specifies under which site root this page has been crawled
rootGroup	Specifies the group this document belongs to
rulesGroup	Specifies which crawl rules group apply to this document
source	Specifies the source meta value as defined by crawl rules for which the selected action is Source. See Crawl Rule Actions.
dataModelClass	Specifies the dataModelClass meta value as defined by crawl rules or which the selected action is Data model class. See Crawl Rule Actions.

The Number of documents in crawler storage, is the number of crawled documents. Usually, this number is higher than the number of documents indexed by the crawler. Some documents may have not been indexed because they were empty or reported errors, or were redirection errors. The crawler storage keeps a copy of all crawled documents to know which pages must be updated.