XML Configuration Reference : Connector : Crawler
 
Crawler
com.exalead.mercury.mami.crawl.v21.Crawler
A crawler configuration. A crawler may contain a CrawlSchedulerConfig to overwrite the default fifo priorities. A crawler may contain a CustomCrawlConfig to enable custom processors.
Parent elements:
com.exalead.mercury.mami.crawl.v21.CrawlConfig (as CrawlConfig)
Attributes:
Name
Type
Default value
Description
name
string
The crawler name. It must be unique across all crawlers.
documentsType
string
The type of documents pushed by this connector. The type of documents must match one of the types declared in your CloudView license file.
fetcher
string
Which fetcher to use.
crawlerServer
string
Crawler server hosting this crawler. See Deployment configuration.
connectorServer
string
Connector server hosting the indexing part of this crawler. See Deployment configuration.
buildGroup
string
Target build group.
dataModel
string
The default data model for documents indexed by this crawler.
storeTextOnly
boolean
True
Whether to store original binary documents, or only converted text.
nthreads
int
1
The number of crawl threads which must be strictly positive.
aggressive
boolean
Whether to enable aggressive crawl, that never sleeps between two requests to the same host.
throttleTimeMS
int
2500
In the case of non-aggressive crawl, this defines the sleep interval between requests to the same host.
ignoreRobotsTxt
boolean
Whether to ignore robots.txt rules. Not recommended.
enableConvertProcessor
boolean
True
Whether to enable remoteconvert-based processor for links extracting in binary documents.
nearDuplicateDetector
boolean
True
Whether to enable the near-duplicate content detector.
patternsDetector
boolean
True
Whether to enable patterns detection in pages.
crawlSitemaps
boolean
True
Whether to crawl sitemaps.
disableConditionalGet
boolean
Whether to always fetch documents, even if the server tells it has not changed.
defaultAccept
boolean
Whether to crawl a url by default when it matches no other accept rule.
defaultIndex
boolean
Whether to index by default when a url matches no index rule.
defaultFollow
boolean
Whether to follow by default when a url matches no follow rule.
defaultFollowRoots
boolean
True
Whether to automatically follow root urls
enableSimpleSiteCollapsing
boolean
True
Whether to generate a site ID suitable for document collapsing.
simpleSiteCollapsingDepth
int
How many path segments to use to generate the site collapsing ID.
mimeTypesMode
string
exclude
Mime types white/black list
smartRefresh
boolean
True
Whether to crawl a fraction of refreshed urls.
smartRefreshMinAgeS
int
3600
Age in seconds at which we may refresh old urls.
smartRefreshMaxAgeS
int
604800
Age in seconds at which we force the refresh of old urls.
archiveDocuments
boolean
When enabled, deleted documents are not deleted, but kept with their deletion date.
enableConsolidation
boolean
True
Define if we use a standard PAPI or a consolidation PAPI.
Nested elements:
Name
Type
Description
mimeTypes
exa.bee.StringConstantValue*
sessionIdBlacklist
exa.bee.StringConstantValue*
SessionId blacklist. These parameters are removed from URLs with a path or query part containing them.
PushAPIFilter
exa.bee.KeyValue*
roots
com.exalead.mercury.mami.crawl.v21.Root*
A list of root urls to start the crawl from.
rootsets
com.exalead.mercury.mami.crawl.v21.RootSet*
A list of files to load urls/sites from.
CrawlSchedulerConfig
com.exalead.mercury.mami.crawl.v21.CrawlSchedulerConfig
CustomCrawlConfig
com.exalead.mercury.mami.crawl.v21.CustomCrawlConfig
Rules
com.exalead.mercury.mami.crawl.v21.Rules*
UrlTesterData
com.exalead.mercury.mami.crawl.v21.UrlTesterData