Toggle menu

Search Crawler

The Search Crawler scrapes the content of another website and indexes it as a custom collection in the platform's search. The standard Search template can then display results from, and links to, the third party site. Multiple sites can be crawled, with separate search collections and configurations for each.

Search Results

In this example the site search is showing results from the current site and an "archive" site that is not part of the iCM installation. Following a link to the "archive" site takes the user to that website. Links open in the same tab/window as the current site.

Search Results
 

The Crawling Process

The searchindex.cfm file calls a crawler specific module called performcrawl.cfm located within the icm/custom/crawler folder. This takes two mandatory attributes:

  • CrawlerFolder - Full path to the crawler folder
  • IndexNames - Comma separated list of index names that correspond to index configuration files

There is an optional DebugMode attribute that can be set to blank, "debug" or "debugcontent". Debug mode will record details of each URL inspected to show if its allowed or blocked by the index configuration. Debugcontent will record the URL details but also show the content extracted for each page. This can be useful in determining the best ContentExtractionMode for a particular site.

Performcrawl.cfm will create a new Java process to perform the actual crawl. It will not wait for this process to exit. It will wait for a couple of seconds to check the process as actually started and then return. Thus the crawl process is performed in the background. In can take considerable time to complete the crawl because of the need to "be polite" in terms of how frequently each site is accessed.

Index Fields

The following search index fields are populated by the crawling process:

FieldDescription
keyidURL of the indexed page
urlURL of the indexed page
titleHTML title of the indexed page
bodyExtracted body text
summaryA summary generated from the extracted text
custom1Crawler4j docid value. This is a unique identifier for this page within the crawl
custom2Crawl depth
parentdataCrawler4j docid value of the page that linked to this page
creationdateDate/time of the crawl
modificationdateDate/time of the crawl

Installation

The Search Crawler is packaged in a single zip file. This should be extracted into the iCM custom folder as in the image below.

Crawler Directory
 

  • conf -This contains the index configuration files in the form <name>.props.cfm
  • data - used by the crawler during the crawling processes. It will create a subdirectory within here that corresponds to each index
  • log - log files are written to this folder

The zip file also contains an example searchindex.cfm file. This should be merged with the existing file in the custom folder.

In the example described below, the searchindex.cfm looks like this:

<cfparam name="ATTRIBUTES.SearchCollectionName" >
<cfoutput>Indexing #ATTRIBUTES.SearchCollectionName# by crawling<br></cfoutput>
<cftry>
    <cfset CrawlerPath = "#Application.CustomFilePath#crawler">
    <cfmodule template="./crawler/performcrawl.cfm" CrawlerFolder="#CrawlerPath#/" IndexNames="#ATTRIBUTES.SearchCollectionName#" DebugMode="debug">
    <cfoutput><br>Crawler started and running in background for index: [#ATTRIBUTES.SearchCollectionName#]<br></cfoutput>
    <cfcatch type="any">
        <cfoutput><strong><br>Crawler failed to started: [#cfcatch.message#]</strong><br></cfoutput>
    </cfcatch>
</cftry>

Configuration

Each index is configured independently in a <name>.props.cfm file in the conf directory. <name> should match the key type set up in your custom search collection and the SolrKeyType set in the file itself.

The installation zip includes a fully annotated example file. Copy this file and rename it, for example:

Config File
 

Properties you'll need to modify for each index are described below. There are also lots of optional properties in the config file to set things like crawl depth, maximum number of pages, http/https pages and timeouts etc.

PropertyDescription
DescriptionA description of this configuration, for example API Docs Crawler
SolrKeyTypeThe value written to SOLR. This should match the key type set up in your custom search collection, for example icmapidocs
SitesToSearchA comma separated list of URLs to crawl. These URLs will also need to be added to the AllowedPrefixes property, for example https://docs.gossinteractive.com/staticfiles/javaapidocs/10060/index.html
FileTypeFilterA regular expression to block certain extensions being indexed
QueryStringBlockFilterA regular expression to block certain query strings being indexed
AllowedPrefixesA comma separated list of allowed prefixes, for example https://docs.gossinteractive.com/staticfiles/
ContentExtractionModeOne of Raw, ArticleExtractor, ArticleSentencesExtractor, DefaultExtractor, KeepEverythingExtractor, LargestContentExtractor, NumWordsRulesExtractor. See https://boilerpipe-web.appspot.com/ for descriptions. ArticleExtractor is normally the best option and avoids indexing headers, footers and navigation

Search Collection

Once the crawler has been installed and configured you need to create a custom search collection in iCM. See Search Collections for the full documentation.

Using the configuration example above, the search collection looks like this:

Search Collection
 

The key type should match the name of the config file and the SolrKeyType within the config.

Run the SearchIndex scheduled task to start the indexing.

The Search Template

Once the content has been indexed you need to tell the Search template about it.

In the standard Search template article extras you can choose whether the article or media results are returned:

Article and Media Extras
 

An option for your custom collection needs to be added. To do this, edit the SEARCHX form (or whatever extras form your Search template uses) and add an option to the COLLECTIONS field:

Search Options
 

The "Value" must match your search collection key type.

This will give you article extras that look like this:

Custom Extras
 

Select the new collection for it to come back in the results.

Last modified on 17 February 2023

Share this page

Facebook icon Twitter icon email icon

Print

print icon