The Search Crawler scrapes the content of another website and indexes it as a custom collection in the platform's search. The standard Search template can then display results from, and links to, the third party site. Multiple sites can be crawled, with separate search collections and configurations for each.
Search Results
In this example the site search is showing results from the current site and an "archive" site that is not part of the iCM installation. Following a link to the "archive" site takes the user to that website. Links open in the same tab/window as the current site.
The Crawling Process
The
- CrawlerFolder - Full path to the crawler folder
- IndexNames - Comma separated list of index names that correspond to index configuration files
There is an optional DebugMode attribute that can be set to blank, "debug" or "debugcontent". Debug mode will record details of each URL inspected to show if its allowed or blocked by the index configuration. Debugcontent will record the URL details but also show the content extracted for each page. This can be useful in determining the best
Index Fields
The following search index fields are populated by the crawling process:
Field | Description |
---|---|
keyid | URL of the indexed page |
url | URL of the indexed page |
title | HTML title of the indexed page |
body | Extracted body text |
summary | A summary generated from the extracted text |
custom1 | Crawler4j docid value. This is a unique identifier for this page within the crawl |
custom2 | Crawl depth |
parentdata | Crawler4j docid value of the page that linked to this page |
creationdate | Date/time of the crawl |
modificationdate | Date/time of the crawl |
Installation
The Search Crawler is packaged in a single zip file. This should be extracted into the iCM custom folder as in the image below.
- conf -This contains the index configuration files in the form
<name>.props.cfm - data - used by the crawler during the crawling processes. It will create a subdirectory within here that corresponds to each index
- log - log files are written to this folder
The zip file also contains an example
In the example described below, the
<cfparam name="ATTRIBUTES.SearchCollectionName" >
<cfoutput>Indexing #ATTRIBUTES.SearchCollectionName# by crawling<br></cfoutput>
<cftry>
<cfset CrawlerPath = "#Application.CustomFilePath#crawler">
<cfmodule template="./crawler/performcrawl.cfm" CrawlerFolder="#CrawlerPath#/" IndexNames="#ATTRIBUTES.SearchCollectionName#" DebugMode="debug">
<cfoutput><br>Crawler started and running in background for index: [#ATTRIBUTES.SearchCollectionName#]<br></cfoutput>
<cfcatch type="any">
<cfoutput><strong><br>Crawler failed to started: [#cfcatch.message#]</strong><br></cfoutput>
</cfcatch>
</cftry>
Configuration
Each index is configured independently in a
The installation zip includes a fully annotated example file. Copy this file and rename it, for example:
Properties you'll need to modify for each index are described below. There are also lots of optional properties in the config file to set things like crawl depth, maximum number of pages, http/https pages and timeouts etc.
Property | Description |
---|---|
Description | A description of this configuration, for example |
SolrKeyType | The value written to SOLR. This should match the key type set up in your custom search collection, for example |
SitesToSearch | A comma separated list of URLs to crawl. These URLs will also need to be added to the AllowedPrefixes property, for example |
FileTypeFilter | A regular expression to block certain extensions being indexed |
QueryStringBlockFilter | A regular expression to block certain query strings being indexed |
AllowedPrefixes | A comma separated list of allowed prefixes, for example |
ContentExtractionMode | One of Raw, ArticleExtractor, ArticleSentencesExtractor, DefaultExtractor, KeepEverythingExtractor, LargestContentExtractor, NumWordsRulesExtractor. See https://boilerpipe-web.appspot.com/ for descriptions. ArticleExtractor is normally the best option and avoids indexing headers, footers and navigation |
Search Collection
Once the crawler has been installed and configured you need to create a custom search collection in iCM. See Search Collections for the full documentation.
Using the configuration example above, the search collection looks like this:
The key type should match the name of the config file and the
Run the SearchIndex scheduled task to start the indexing.
The Search Template
Once the content has been indexed you need to tell the Search template about it.
In the standard Search template article extras you can choose whether the article or media results are returned:
An option for your custom collection needs to be added. To do this, edit the SEARCHX form (or whatever extras form your Search template uses) and add an option to the COLLECTIONS field:
The "Value" must match your search collection key type.
This will give you article extras that look like this:
Select the new collection for it to come back in the results.