Search Crawler

The Search Crawler scrapes the content of another website and indexes it as a custom collection in the platform's search. The standard Search template can then display results from, and links to, the third party site. Multiple sites can be crawled, with separate search collections and configurations for each.

Search Results

In this example the site search is showing results from the current site and an "archive" site that is not part of the iCM installation. Following a link to the "archive" site takes the user to that website. Links open in the same tab/window as the current site.

The Crawling Process

The searchindex.cfm file calls a crawler specific module called performcrawl.cfm located within the icm/custom/crawler folder. This takes two mandatory attributes:

CrawlerFolder - Full path to the crawler folder
IndexNames - Comma separated list of index names that correspond to index configuration files

There is an optional DebugMode attribute that can be set to blank, "debug" or "debugcontent". Debug mode will record details of each URL inspected to show if its allowed or blocked by the index configuration. Debugcontent will record the URL details but also show the content extracted for each page. This can be useful in determining the best ContentExtractionMode for a particular site.

Performcrawl.cfm will create a new Java process to perform the actual crawl. It will not wait for this process to exit. It will wait for a couple of seconds to check the process as actually started and then return. Thus the crawl process is performed in the background. In can take considerable time to complete the crawl because of the need to "be polite" in terms of how frequently each site is accessed.

Index Fields

The following search index fields are populated by the crawling process:

Field	Description
keyid	URL of the indexed page
url	URL of the indexed page
title	HTML title of the indexed page
body	Extracted body text
summary	A summary generated from the extracted text
custom1	Crawler4j docid value. This is a unique identifier for this page within the crawl
custom2	Crawl depth
parentdata	Crawler4j docid value of the page that linked to this page
creationdate	Date/time of the crawl
modificationdate	Date/time of the crawl

Installation

The Search Crawler is packaged in a single zip file. This should be extracted into the iCM custom folder as in the image below.

conf -This contains the index configuration files in the form <name>.props.cfm
data - used by the crawler during the crawling processes. It will create a subdirectory within here that corresponds to each index
log - log files are written to this folder

The zip file also contains an example searchindex.cfm file. This should be merged with the existing file in the custom folder.

In the example described below, the searchindex.cfm looks like this:

<cfparam name="ATTRIBUTES.SearchCollectionName" > <cfoutput>Indexing #ATTRIBUTES.SearchCollectionName# by crawling </cfoutput> <cftry> <cfset CrawlerPath = "#Application.CustomFilePath#crawler"> <cfmodule template="./crawler/performcrawl.cfm" CrawlerFolder="#CrawlerPath#/" IndexNames="#ATTRIBUTES.SearchCollectionName#" DebugMode="debug"> <cfoutput> Crawler started and running in background for index: [#ATTRIBUTES.SearchCollectionName#] </cfoutput> <cfcatch type="any"> <cfoutput> Crawler failed to started: [#cfcatch.message#] </cfoutput> </cfcatch> </cftry>

Configuration

Each index is configured independently in a <name>.props.cfm file in the conf directory. <name> should match the key type set up in your custom search collection and the SolrKeyType set in the file itself.

The installation zip includes a fully annotated example file. Copy this file and rename it, for example:

Properties you'll need to modify for each index are described below. There are also lots of optional properties in the config file to set things like crawl depth, maximum number of pages, http/https pages and timeouts etc.

Property	Description
Description	A description of this configuration, for example API Docs Crawler
SolrKeyType	The value written to SOLR. This should match the key type set up in your custom search collection, for example icmapidocs
SitesToSearch	A comma separated list of URLs to crawl. These URLs will also need to be added to the AllowedPrefixes property, for example https://docs.gossinteractive.com/staticfiles/javaapidocs/10060/index.html
FileTypeFilter	A regular expression to block certain extensions being indexed
QueryStringBlockFilter	A regular expression to block certain query strings being indexed
AllowedPrefixes	A comma separated list of allowed prefixes, for example https://docs.gossinteractive.com/staticfiles/
ContentExtractionMode	One of Raw, ArticleExtractor, ArticleSentencesExtractor, DefaultExtractor, KeepEverythingExtractor, LargestContentExtractor, NumWordsRulesExtractor. See https://boilerpipe-web.appspot.com/ for descriptions. ArticleExtractor is normally the best option and avoids indexing headers, footers and navigation

Search Collection

Once the crawler has been installed and configured you need to create a custom search collection in iCM. See Search Collections for the full documentation.

Using the configuration example above, the search collection looks like this:

The key type should match the name of the config file and the SolrKeyType within the config.

Run the SearchIndex scheduled task to start the indexing.

The Search Template

Once the content has been indexed you need to tell the Search template about it.

In the standard Search template article extras you can choose whether the article or media results are returned:

An option for your custom collection needs to be added. To do this, edit the SEARCHX form (or whatever extras form your Search template uses) and add an option to the COLLECTIONS field: