For the deduplication to operate effectively there are certain parameters that have to be set. The majority of these are configured using the Deduplication Preferences form (DEDUPPREFERENCES). To set the preferences, the form should be published as an iCM Form App shortcut.
Overview
This form lets you define an ordered list of keyfields found in your user profiles and set the comparator that should be used with each. To save your changes, submit the form via the left hand action panel.
The Initial Filter
The initial filter drop-down is present to maintain backwards compatibility with older versions. It should be set as family/last name. In the most recent versions of the Machine Learning worker (version 1.0.8, released December 2023), the top-most keyfield is used as the filter when performing full deduplication.
Keyfields
The main part of the preferences form lists all of the properties in your site user profile.
The purpose of a keyfield is to group the filtered records containing the same value into buckets. The aim is to have several equally sized buckets. However, as the best keyfield may not be available for every search, you need to sort the list to prioritise the order in which fields will be compared.
The fields should be sorted with the best option first - the best option giving you a reasonable number of buckets each containing similar numbers of users.
Our recommended ordering and the comparators to use are:
- LASTNAME - single term
- FIRSTNAME - single term
- POSTCODE - exact term
- PREFNAME - name term
- FULLNAME - name term
- DISPLAYNAME - name term
- CITY - single term
- EMAIL - exact term
- TELEPHONE - exact term
- MOBILE - exact term
Other profile fields listed in the table are best set as "do not compare" and moved to the end of the list.
Comparators
The final consideration is how each of the fields should be compared. There are five comparators and an exclusion available:
- single term: Uses the Jaro-Winkler similarity measure, which many studies have found to be the best available general string comparator for deduplication. Use this for short strings like given names and family names
- multi term: Uses q-grams comparison, a phrase is split into a number of n-grams these are then compared and confidence is based on the overlap between terms. Useful for job titles and road names
- exact term: true if the values match (case insensitive) use for national insurance number or UPRN
- name term: A rule-based comparator which understands the structure of personal names. It knows about things like initials, middle names, and that sometimes given and family name get reversed. Uses Levenshtein to account for typos
- phonetic term: uses a Metaphone Comparator similar to Soundex, but using the finer-grained Metaphone code scheme. Used for family or given names where it will work for spelling variations
- do not compare: The field will be ignored as part of the comparison, typically these would be known unique or sparsely populated fields.
Each comparator gives the value of a field/column a score between 0.0 (completely different) and 1.0 (exactly the same). As more fields are compared the overall score for a record is pushed up or down.
Comparators also include high and low values. These values define the contribution of a field towards the final score for a pair of records. For example, exact term will give an score of 1.0 if the content of the field on comparing records is identical, but if you have set
The default high and low values set in the worker are:
Comparator | Low | High |
---|---|---|
JaroWinkler (for single term) | 0.5 | 0.7 |
QGramComparator (for multi-term) | 0.5 | 0.7 |
ExactComparator (for exact term) | 0.2 | 0.9 |
PersonNameComparator (for name term) | 0.4 | 0.8 |
MetaphoneComparator (for phonetic term) | 0.2 | 0.9 |
The following links provide more information:
- Introduction to scoring and thresholds https://github.com/larsga/Duke/wiki/HowItWorks (opens new window) and https://github.com/larsga/Duke/wiki/TuningGuide (opens new window)
- Comparators https://github.com/larsga/Duke/wiki/Comparator (opens new window)