How to Normalize Addresses & Detect Duplicates in Arbutus

      Normalize Addresses

      In this article, you'll learn how to normalize addresses and detect hidden duplicates using functions in Arbutus Analyzer.  Also included are files, scripts and instructions for you to run this yourself in Analyzer.  We'll take a look at two examples: first, using a single file, and second, comparing normalized addresses across two files.


      What Control Risks Can We Address?

      • Vendor management controls may not detect multiple vendors using the same address or vendors and employees using the same addresses; possible risk of fraud
      • Vendor management controls to maintain up-to-date addresses may have failed
      • Compliance controls may not detect counterparties that may be on watch lists

      When To Use It

      • Identifying multiple instances of the same address in one file, for example, vendors
      • Identifying possible address matches between two files (vendors-employees or vendors-watch lists)

      How It Works

      We want to detect duplicate addresses in the Vendor master file, both to clean up our data and identify potential double-billing frauds.  We'll do this by normalizing the addresses so they are more comparable and then look for duplicates in the normalized addresses, rather than the addresses as recorded. It is important to normalize the addresses as fraud perpetrators may try to mask their efforts by making the addresses appear different.

      Analyzer Functionality

      • SortNormalize function. In a computed field, this function will do the following:
        • Remove all non-alphanumeric characters except blanks
        • Replace all international characters (containing accents, etc.) with the English base character
        • Trim leading blanks
        • Compact contiguous blanks into a single blank
        • Upper-case the remaining data
        • Allow specification of one or more substitution files to standardize alternate valid     representations of values within the data (e.g. “Rd” for “Road”, “St” for “Street“) - See the "ADDR.txt" file in the downloads
        • Sorts the remaining normalized whole words (including number sets) separated by blanks in descending order
      • Duplicates command
      • Join command

      https://cdn2.hubspot.net/hubfs/3336879/background%20image%20.png

      What Is The ADDR.txt file?

      This is a simple text file with two columns of words.  The words on the left are those that can be found in most addresses, including abbreviations.  The words on the right are the result of the transformation by SortNormalize().  For example, the following instances of "avenue" are all transformed to the text on the right. As well, this will remove the "noise" words that have no partner in the list, such as "THE" and "OF".

      ADDRESS TEXT Image 1

      This is a plain text file, so users can easily update it by adding another line containing a new replacement pair.

      This file can be referenced in the computed field expression without a full path only if it is located in the Arbutus project folder.  If the text file is located in other locations, you must specify the full path.

       

      Exact Matches, Single-File Testing

      • Create a computed field (Edit >> Table Layout >> Add a New Expression) in your file that will transform the address using the SortNormalize function.  The expression should reference the address substitution file ADDR.txt.
      Exact Match
      Although it is not necessary, you can add the computed field to the View next to the original address field for comparison.exact match image 2
       
      • Run the Duplicates command using the new computed field and the zip code as the key fields.  Including the zip code helps reduce false positives by ensuring that two identical addresses in different zip codes will not be included in the output.  In the "List fields" area below, select the original address field and any other fields you would like in your output.

        Exact Match 3
      • Name the output file and click "OK".  The result shows all normalized matches that are identical.  Note how different the original addresses are from each other, and how they are identical when normalized.

        Exact Match 4
      #f4f4f4

      Exact Matches, Multifile Testing

      In this example, we will compare our vendor addresses to addresses on a US watchlist.

      1. Create the computed fields in each of the two files.
      2. Create a Matched Many-to-Many Join between the two files, using the two new computed fields as the key fields as well as the zip codes.  (It does not matter which file is the Primary or Secondary for our purposes.)
      3. Select any other fields you need in your output from the "Primary fields to output" and "Secondary fields to output" lists
      4. Name the output file and click "OK"
      5. The result will show all normalized matches that are identical; the possible matches are side-by-side.  It is possible for the same vendor address to match to multiple watchlist addresses and vice-versa.  Vendor 10088 matches to three watchlist addresses

        Exact Match 5

        Exact Match 6
      https://cdn2.hubspot.net/hubfs/3336879/4.png
      #f5f5f5