Crawler

Name	Type	Default value	Description
name	string		The crawler name. It must be unique across all crawlers.
documentsType	string		The type of documents pushed by this connector. The type of documents must match one of the types declared in your CloudView license file.
fetcher	string		Which fetcher to use.
crawlerServer	string		Crawler server hosting this crawler. See Deployment configuration.
connectorServer	string		Connector server hosting the indexing part of this crawler. See Deployment configuration.
buildGroup	string		Target build group.
dataModel	string		The default data model for documents indexed by this crawler.
storeTextOnly	boolean	True	Whether to store original binary documents, or only converted text.
nthreads	int	1	The number of crawl threads which must be strictly positive.
aggressive	boolean		Whether to enable aggressive crawl, that never sleeps between two requests to the same host.
throttleTimeMS	int	2500	In the case of non-aggressive crawl, this defines the sleep interval between requests to the same host.
ignoreRobotsTxt	boolean		Whether to ignore robots.txt rules. Not recommended.
enableConvertProcessor	boolean	True	Whether to enable remoteconvert-based processor for links extracting in binary documents.
nearDuplicateDetector	boolean	True	Whether to enable the near-duplicate content detector.
patternsDetector	boolean	True	Whether to enable patterns detection in pages.
crawlSitemaps	boolean	True	Whether to crawl sitemaps.
disableConditionalGet	boolean		Whether to always fetch documents, even if the server tells it has not changed.
defaultAccept	boolean		Whether to crawl a url by default when it matches no other accept rule.
defaultIndex	boolean		Whether to index by default when a url matches no index rule.
defaultFollow	boolean		Whether to follow by default when a url matches no follow rule.
defaultFollowRoots	boolean	True	Whether to automatically follow root urls
enableSimpleSiteCollapsing	boolean	True	Whether to generate a site ID suitable for document collapsing.
simpleSiteCollapsingDepth	int		How many path segments to use to generate the site collapsing ID.
mimeTypesMode	string	exclude	Mime types white/black list
smartRefresh	boolean	True	Whether to crawl a fraction of refreshed urls.
smartRefreshMinAgeS	int	3600	Age in seconds at which we may refresh old urls.
smartRefreshMaxAgeS	int	604800	Age in seconds at which we force the refresh of old urls.
archiveDocuments	boolean		When enabled, deleted documents are not deleted, but kept with their deletion date.
enableConsolidation	boolean	True	Define if we use a standard PAPI or a consolidation PAPI.

Name	Type	Description
mimeTypes	exa.bee.StringConstantValue*
sessionIdBlacklist	exa.bee.StringConstantValue*	SessionId blacklist. These parameters are removed from URLs with a path or query part containing them.
PushAPIFilter	exa.bee.KeyValue*
roots	com.exalead.mercury.mami.crawl.v21.Root*	A list of root urls to start the crawl from.
rootsets	com.exalead.mercury.mami.crawl.v21.RootSet*	A list of files to load urls/sites from.
CrawlSchedulerConfig	com.exalead.mercury.mami.crawl.v21.CrawlSchedulerConfig
CustomCrawlConfig	com.exalead.mercury.mami.crawl.v21.CustomCrawlConfig
Rules	com.exalead.mercury.mami.crawl.v21.Rules*
UrlTesterData	com.exalead.mercury.mami.crawl.v21.UrlTesterData