Normalize Address and Detect Duplicates
How to Normalize Addresses and Detect hidden Duplicates using Functions in Arbutus Analyzer.
What control risks can we address?
- Vendor management controls may not detect multiple vendors using the same address or vendors and employees using the same addresses; possible risk of fraud
- Vendor management controls to maintain up-to-date addresses may have failed
- Compliance controls may not detect counterparties that may be on watch lists
When to use it?
- Identifying multiple instances of the same address in one file, for example, vendors
- Identifying possible address matches between two files (vendors-employees or vendors-watch lists)
How it works
We want to detect duplicate addresses in the Vendor master file, both to clean up our data and identify potential double-billing frauds. We'll do this by normalizing the addresses so they are more comparable and then look for duplicates in the normalized addresses, rather than the addresses as recorded. It is important to normalize the addresses as fraud perpetrators may try to mask their efforts by making the addresses appear different.
Remove all non-alphanumeric characters except blanks
Replace all international characters (containing accents, etc.) with the English base character
Trim leading blanks
Compact contiguous blanks into a single blank
Upper-case the remaining data
Allow specification of one or more substitution files to standardize alternate valid representations of values within the data (e.g. “Rd” for “Road”, “St” for “Street“) - See the "ADDR.txt" file in the downloads
Sorts the remaining normalized whole words (including number sets) separated by blanks in descending order
- Duplicates command
- Join command
What is the ADDR.TXT File?
This is a simple text file with two columns of words. The words on the left are those that can be found in most addresses, including abbreviations. The words on the right are the result of the transformation by SortNormalize(). For example, the following instances of "avenue" are all transformed to the text on the right. As well, this will remove the "noise" words that have no partner in the list, such as "THE" and "OF".
This is a plain text file, so users can easily update it by adding another line containing a new replacement pair. This file can be referenced in the computed field expression without a full path only if it is located in the Arbutus project folder. If the text file is located in other locations, you must specify the full path.
Exact Matches - Single File Testing
Create a computed field (Edit >> Table Layout >> Add a New Expression) in your file that will transform the address using the SortNormalize function. The expression should reference the address substitution file ADDR.txt.
Although it is not necessary, you can add the computed field to the View next to the original address field for comparison.
Run the Duplicates command using the new computed field and the zip code as the key fields. Including the zip code helps reduce false positives by ensuring that two identical addresses in different zip codes will not be included in the output. In the "List fields" area below, select the original address field and any other fields you would like in your output.
Name the output file and click "OK". The result shows all normalized matches that are identical. Note how different the original addresses are from each other, and how they are identical when normalized.
Exact Matches - Multiple File Testing
- Create the computed fields in each of the two files.
- Create a Matched Many-to-Many Join between the two files, using the two new computed fields as the key fields as well as the zip codes. (It does not matter which file is the Primary or Secondary for our purposes.)
- Select any other fields you need in your output from the "Primary fields to output" and "Secondary fields to output" lists
- Name the output file and click "OK"
- The result will show all normalized matches that are identical; the possible matches are side-by-side. It is possible for the same vendor address to match to multiple watchlist addresses and vice-versa. Vendor 10088 matches to three watchlist addresses