Technical Insights/Normalize Addresses and Detect Duplicates

How to Normalize Addresses and Detect hidden Duplicates using Functions in Arbutus Analyzer.

This article includes files, scripts and instructions for you to run this yourself in Analyzer. We'll take a look at two examples: first, using a single file, and second, comparing normalized addresses across two files.

What control risks can we address?

Vendor management controls may not detect multiple vendors using the same address or vendors and employees using the same addresses; possible risk of fraud
Vendor management controls to maintain up-to-date addresses may have failed
Compliance controls may not detect counterparties that may be on watch lists

How it works

We want to detect duplicate addresses in the Vendor master file, both to clean up our data and identify potential double-billing frauds. We'll do this by normalizing the addresses so they are more comparable and then look for duplicates in the normalized addresses, rather than the addresses as recorded. It is important to normalize the addresses as fraud perpetrators may try to mask their efforts by making the addresses appear different.

Analyzer Functionality

SortNormalize function. In a computed field, this function will do the following:

Remove all non-alphanumeric characters except blanks
Replace all international characters (containing accents, etc.) with the English base character
Trim leading blanks
Compact contiguous blanks into a single blank
Upper-case the remaining data
Allow specification of one or more substitution files to standardize alternate valid representations of values within the data (e.g. “Rd” for “Road”, “St” for “Street“) - See the "ADDR.txt" file in the downloads
Sorts the remaining normalized whole words (including number sets) separated by blanks in descending order
Duplicates command
Join command

What is the ADDR.TXT File?

This is a simple text file with two columns of words. The words on the left are those that can be found in most addresses, including abbreviations. The words on the right are the result of the transformation by SortNormalize(). For example, the following instances of "avenue" are all transformed to the text on the right. As well, this will remove the "noise" words that have no partner in the list, such as "THE" and "OF".

ADDRESS TEXT Image 1

This is a plain text file, so users can easily update it by adding another line containing a new replacement pair. This file can be referenced in the computed field expression without a full path only if it is located in the Arbutus project folder. If the text file is located in other locations, you must specify the full path.

Run the Duplicates command using the new computed field and the zip code as the key fields. Including the zip code helps reduce false positives by ensuring that two identical addresses in different zip codes will not be included in the output. In the "List fields" area below, select the original address field and any other fields you would like in your output.

Exact Match 3

In this example, we will compare our vendor addresses to addresses on a US watchlist.

Create the computed fields in each of the two files.
Create a Matched Many-to-Many Join between the two files, using the two new computed fields as the key fields as well as the zip codes. (It does not matter which file is the Primary or Secondary for our purposes.)
Select any other fields you need in your output from the "Primary fields to output" and "Secondary fields to output" lists
Name the output file and click "OK"
The result will show all normalized matches that are identical; the possible matches are side-by-side. It is possible for the same vendor address to match to multiple watchlist addresses and vice-versa. Vendor 10088 matches to three watchlist addresses

Exact Match 5

Exact Match image

Products

What's New

Global Partners

By Industry

By Business Processes

By Popularity

Resources

Webinars

Arbutus Services

Normalize Address and Detect Duplicates

How to Normalize Addresses and Detect hidden Duplicates using Functions in Arbutus Analyzer.

What control risks can we address?

When to use it?

How it works

Analyzer Functionality

What is the ADDR.TXT File?

Exact Matches - Single File Testing

Exact Matches - Multiple File Testing

Support Documents

USA.FIL

USA.FMT

Master.FIL

Master-FMT

ADDR.TXT

Analytics By Business Process

Analytics By Business Process

Arbutus Technical Insights

Explore more of our Technical Resources

Request Trial

Help & Documentation

Resources

Training & Consulting