This article looks at one way very large datasets can be processed in batches, with the state of the job stored by the Site Session Store worker, which can then be queried for progress updates.
The API Server supports asynchronous requests, with the result of a request being passed in a callback to another worker. This allows us to create a looping process where one end point calls another, then passes the result back to itself, where a decision can be made whether or not to make the call again. Each pass of the loop processes the next set of data. When all of the data has been processed the loop ends and your end point "does something" with the result.
The framework can be used as is and downloaded from this page. The zip file includes several example "target" end points which actually do the job of processing the data, and can be copied and modified as needed.
Quick Start
Download and import the end point group in the downloads area. You'll find them in a group called formstraining.ajaxpolling.
The start end point has several tests you can run. The Article Query test is looked at in more detail below.
To add your own end point into the framework, it's name needs to be passed to the start end point as the
The example framework includes a state storage mechanism which uses the Site Session Store worker. This standard worker is not enabled by default in all sites. You may need to add it into the config file of your API Server as
{
"name": "sitesessionstore",
"instances": 1
}
Overview
The core of the framework uses three end points: start, continueCB, and your own target end point. The end points interacting with the Session Store worker are optional and can be removed, and references to them removed from the other end points and their schema.
End Points
Start
The start end point first creates a session which is used to record the progress of the job. It then calls your target end point asynchronously, which will actually do the work that's needed.
resp = this.callWorkerMethod("serverlibrary", params.endPointName, {
"sessionData": sessionData,
"endPointParams": endPointParams,
"_async": {
"asyncCallerId": params.id,
"callback": {
"methodName": "formstraining.ajaxpolling.continueCB",
"workerName": "serverlibrary",
"additionalParams": {
"endPointName": params.endPointName,
"id": params.id,
"endPointParams": endPointParams,
"sessionData": sessionData
}
}
}
});
The result of the call to your target end point is passed to continueCB, along with the parameters of the original request in the
Parameters
The start End points requires the following.
{
"endPointName": "formstraining.examples.articleQuery",
"endPointParams": {
"fromEmail": "test@testsite.com",
"batchSize": 10,
"toEmail": "me@testsite.com"
},
"id": "1"
}
endPointName - the name of your target end point that will do the work in batchesendPointParams - the parameters object included with the initial request to the target end pointid - recorded throughout the processing
Target
The target end point actually performs the job you need doing (database queries, requests to other services, sending spam etc). It is also responsible for batching up the data to process. It is called in two different situations:
- An initial request is made by the start end point. This request must include all of the parameters your end point needs in the
endPointParams object. - Subsequent requests are made by the continueCB end point. This end point maintains the
endPointParams of the initial request and apreviousResult object which holds the result of the initial and subsequent calls to the target end point.
The response from your target end point must return an object. This object should include:
- A
"complete" boolean to indicate whether or not processing is complete - An optional
"progress" : { "progressPercent" : 50 } property - An optional
finalResult property (which will be stored in the session data) - Parameters needed for the next pass of the loop
Example - Article Query
In reality this example wouldn't need to be processed in batches. Querying one or two properties of items in the iCM database is fairly quick, even when dealing with thousands of items. However, if you are requesting object data, or article content, responses will be slower and use more memory, so making queries in batches of 50 or so makes sense.
This end point queries the iCM database for all article headings and IDs. Articles are queried in batches of ten, and the result of each query is added to a CSV file. The CSV is emailed at the end of the process and then deleted.
The articleQuery end point in the download zip has inline comments you should read alongside this documentation.
Initial Request
The first time this end point is called it returns:
{
"exportFileName": exportFileName,
"articleIDs": articleIDs,
"totalNumArticles": totalNumArticles,
"progress": {
"progressPercent": (((totalNumArticles - articleIDs.length) * 100) / totalNumArticles)
},
"complete": false
}
exportFileName - the path to a temporary file created on the file system using Node's "fs"articleIDs - an array of the IDs of all of the (non-secure and live) articles in the databasetotalNumArticles - a count of the article IDsprogress - the percentage of articles processed so farcomplete - false, we need to make another call
This is passed to the continueCB end point as part of the callback from the start end point.
Subsequent Requests
Further calls to the end point process the next batch of articles in the array. The array of current article IDs is "spliced" (remember that splice changes the original array), so our
If there are still articles to process, the end point returns an updated version of response object above.
Final Request
When there are no more article IDs to process (when
{
"complete": true,
"finalResult": {
"summary": "Finished processing all articles"
}
}
continueCB
When called for the first time this end point receives the result from the target end point, plus all of the
If the "complete" flag is set as false in the response from the target, continueCB sets up another asynchronous call to the target, with itself as the end point to handle the callback (creating a loop). The
resp = this.callWorkerMethod("serverlibrary", params.endPointName, {
"sessionData": params.sessionData,
"endPointParams": params.endPointParams, // Provided by the start end point on the first call, then the callback.additionalParams below on subsequent calls
"previousResult": params.response.result, // Returned from the target end point
"_async": {
"asyncCallerId": params.id,
"callback": {
"methodName": "formstraining.ajaxpolling.continueCB",
"workerName": "serverlibrary",
"additionalParams": {
"endPointName": params.endPointName,
"id": params.id,
"endPointParams": params.endPointParams,
"sessionData": params.sessionData
}
}
}
});
Session Data
The example framework uses the Session Store worker to store progress information while data is being processed. If your target end point returns a "progress" object you can use the framework as is.
The start end point creates a session then immediately records a progress update:
var sessionData = this.invokeEP(".initProgress", {
"id": params.id
});
sessionData = this.invokeEP(".updateProgress", {
"sessionData": sessionData,
"status": "RUNNING",
"progressPercent": 0,
"description": "Running"
});
Your target end point should return a progress update to the continueCB end point. The examples calculate a percentage:
"progress": {
"progressPercent": (((totalNumArticles - articleIDs.length) * 100) / totalNumArticles)
}
The continueCB end point then passes that to the session end points for recording:
params.sessionData = this.invokeEP(".updateProgress", {
"sessionData": params.sessionData,
"status": "RUNNING",
"progressPercent": Math.floor(progressPercent),
"description": "Running",
"accumulate": accumulate
});
The final pass of the loop records that the job is complete:
params.sessionData = this.invokeEP(".completeProgress", {
"sessionData": params.sessionData,
"description": "Completed",
"finalResult": finalResult
});
Progress can be queried by calling the queryProgress End point with the