Name | Type | Default value | Description |
---|---|---|---|
name | string | The crawler name. It must be unique across all crawlers. | |
documentsType | string | The type of documents pushed by this connector. The type of documents must match one of the types declared in your CloudView license file. | |
fetcher | string | Which fetcher to use. | |
crawlerServer | string | Crawler server hosting this crawler. See Deployment configuration. | |
connectorServer | string | Connector server hosting the indexing part of this crawler. See Deployment configuration. | |
buildGroup | string | Target build group. | |
dataModel | string | The default data model for documents indexed by this crawler. | |
storeTextOnly | boolean | True | Whether to store original binary documents, or only converted text. |
nthreads | int | 1 | The number of crawl threads which must be strictly positive. |
aggressive | boolean | Whether to enable aggressive crawl, that never sleeps between two requests to the same host. | |
throttleTimeMS | int | 2500 | In the case of non-aggressive crawl, this defines the sleep interval between requests to the same host. |
ignoreRobotsTxt | boolean | Whether to ignore robots.txt rules. Not recommended. | |
enableConvertProcessor | boolean | True | Whether to enable remoteconvert-based processor for links extracting in binary documents. |
nearDuplicateDetector | boolean | True | Whether to enable the near-duplicate content detector. |
patternsDetector | boolean | True | Whether to enable patterns detection in pages. |
crawlSitemaps | boolean | True | Whether to crawl sitemaps. |
disableConditionalGet | boolean | Whether to always fetch documents, even if the server tells it has not changed. | |
defaultAccept | boolean | Whether to crawl a url by default when it matches no other accept rule. | |
defaultIndex | boolean | Whether to index by default when a url matches no index rule. | |
defaultFollow | boolean | Whether to follow by default when a url matches no follow rule. | |
defaultFollowRoots | boolean | True | Whether to automatically follow root urls |
enableSimpleSiteCollapsing | boolean | True | Whether to generate a site ID suitable for document collapsing. |
simpleSiteCollapsingDepth | int | How many path segments to use to generate the site collapsing ID. | |
mimeTypesMode | string | exclude | Mime types white/black list |
smartRefresh | boolean | True | Whether to crawl a fraction of refreshed urls. |
smartRefreshMinAgeS | int | 3600 | Age in seconds at which we may refresh old urls. |
smartRefreshMaxAgeS | int | 604800 | Age in seconds at which we force the refresh of old urls. |
archiveDocuments | boolean | When enabled, deleted documents are not deleted, but kept with their deletion date. | |
enableConsolidation | boolean | True | Define if we use a standard PAPI or a consolidation PAPI. |
Name | Type | Description |
---|---|---|
mimeTypes | exa.bee.StringConstantValue* | |
sessionIdBlacklist | exa.bee.StringConstantValue* | SessionId blacklist. These parameters are removed from URLs with a path or query part containing them. |
PushAPIFilter | exa.bee.KeyValue* | |
roots | com.exalead.mercury.mami.crawl.v21.Root* | A list of root urls to start the crawl from. |
rootsets | com.exalead.mercury.mami.crawl.v21.RootSet* | A list of files to load urls/sites from. |
CrawlSchedulerConfig | com.exalead.mercury.mami.crawl.v21.CrawlSchedulerConfig | |
CustomCrawlConfig | com.exalead.mercury.mami.crawl.v21.CustomCrawlConfig | |
Rules | com.exalead.mercury.mami.crawl.v21.Rules* | |
UrlTesterData | com.exalead.mercury.mami.crawl.v21.UrlTesterData |