You can configure the fetcher of the Crawler connector in the CRAWLER CONNECTOR > Fetcher tab as described below. This subsection details how to configure the fetcher general parameters as well as the rules for the fetcher headers (the HTTP headers that are sent to all URLs).
Configure the Fetcher
1. In the Fetcher tab, expand More options.
2. Select Enable cookies for all sites to enable cookies for all sites, not only site authenticated with an HTML form and a cookie.
3. In User agent, enter the User-Agent header string to send in HTTP requests.
4. In From, enter the email address sent in the HTTP header.
5. In Proxy address, enter the proxy server address.
6. Define your proxy server credentials in Proxy username and Proxy password.
7. Define the timeout parameters:
◦ Read timeout (s): The read timeout expressed in seconds.
◦ Write timeout (s): The write timeout expressed in seconds.
◦ Connect timeout (s): The connect timeout expressed in seconds.
◦ Download timeout (s): The maximum connection time before canceling a download in seconds.
8. Click Apply.
Configure the Fetcher Headers
1. Click Add header to add a new header.
For all header rules, specify the header name and value.
2. In Name, enter the header name. For example, Accept-Language
3. In Value, enter the header value. For example, en
4. Click Apply.
Crawl Secured Sites
This subsection describes the parameters that you can configure to crawl secured sites, from Fetcher > Rules Configuration.
Some configuration parameters can be rule-based. Each configuration whose rules match the URL to be fetched is applied.
Ensure that there are no conflicts in the definition of your Fetcher configuration rules. For example, do not define two different authentication schemes with overlapping rules. Each rule-based configuration depends on the type of rule.
Choose an Authentication Type
From the Auth type select box, you can choose one of the following authentication types.
Auth type
requires...
Basic
Valid user name
Clear text password
Digest
Valid user name
Clear text password
Realm
Ntlm
Valid user name
Clear text password
NTLM (Microsoft Windows NT LAN Manager) domain and host to authenticate users to Microsoft Windows 2000 domains.
Form
A login through an HTML authentication form with cookies handling.
Configure Authentication Parameters
Once an Authentication type is selected, configure its related parameters. Below are example procedures for basic and form authentication type.
Configure Basic Authentication
1. Click Add rule.
◦ Enter a rule name, for example user.
◦ For Type, select Authentication
◦ Click Accept.
2. From Auth type, select Basic.
3. In Username, enter a valid user name for the authentication type.
4. In Password, enter a valid password for the user name.
5. In URLpatterns, enter the site prefix to be authenticated. For example, http://example.com/private/basic
6. Click Add pattern to enter additional site prefixes using the current authentication configuration.
7. Click Apply.
The configurations of the Digest and Ntlm authentication types are similar. For Digest, also define the realm authentication. For Ntlm, also define the domain name and the host name.
Configure Form Authentication
1. Click Add rule.
◦ Enter a rule name, for example user.
◦ For Type, select Authentication
◦ Click Accept.
2. From Auth type, select Form.
3. In Login page URL, enter the URL of the page on which to find the authentication form.
4. If there are several html forms on the login page, specify either a CSS id or a class to identify which form must be submitted in Form Id, Form Class or FormName. These parameters are not required if there is a single form.
5. From the Form fields pane, configure the fields required to authenticate on the authentication form, using the Name and Value fields. Specify the input name of the field, not its label.
Note: The best way to find input names is to open the source HTML code of the page and search for input nodes with attributes type="text" and type="password", and copy the values of their name="<input name>" attributes.
Click Add form field for additional fields or Add encrypted form field if required.
For example, you can add:
◦ username – John
◦ password – ******
◦ lang – EN
Note: These fields are case-sensitive.
6. The Crawler Connector must know whether a fetched page is successfully authenticated or not. This is expressed with a condition based on Authentication success or Authentication failure. Pages that fail the test trigger a new authentication and are fetched again.
Select the Success/Failure condition based on authentication success or failure:
◦ if a redirection occurs
◦ if status is <value>
◦ if header with name <name> is present or equals <value>
◦ if body contains <value>
OR click Edit condition as xml (experts only) to enter a complex condition.
Examples:
◦ Set the condition to authentication failure if redirection URL matches"/login.php", that is to say, if the authentication fails and redirects you to the login page.
◦ For a private forum, you can set the condition to: authentication failure if body contains "Please login" OR authentication success if body contains "Welcome John".
7. In URLpatterns, enter the site prefixes to be authenticated. For example, http://www.example.com/.
Note: It might seem redundant with the URL specified in the Configuration tab but the fetcher needs to know for which URLs of the site, authentication is required.
8. Click Add pattern for additional site prefixes.
9. Click Apply.
About Success/Failure Conditions
We must know whether a fetched page is successfully authenticated or not. This is expressed with a condition based on authentication success or authentication failure. Pages that fail the test, trigger a new authentication, and are then fetched again.
Condition
Description
redirection
Redirections include HTTP status codes 301, 302, 303, 307.
The connector can also test redirections to a URL containing a specific string.
status
Checks the HTTP status code for a specific value.
For example, you can specify the 200 OK status to indicate that pages are considered as authenticated.
header with name
Checks for a specific header in the HTTP response.
is present – checks whether the header name is present or not.
contains – check whether the header name is present with a given value.
body contains
Looks for a specific string in the body of the HTTP response.
Applies only to non-empty documents with text/* content types.
You can test it with a string to find either on non-authenticated pages (for example, a login form or prompt) or on authenticated pages ("welcome $username", logout form).