Data Quality Management Practical Tests and Scripted Solutions
There are several simple steps you can take to test your data's quality.
Ideally, this should be done immediately after defining a file in Arbutus Analyzer and before beginning your analysis. Running the tests below will document the results in the Command Log. It's a best practice to use a script to execute your tests and have the script direct the results to an output file. This can become part of your workpapers to demonstrate that the tests have been performed.
First Level Testing
Control Totals - Ask your data provider to inform you how many records were in the source data and what the control or hash totals were for numeric fields:
- Count command to document the record count
- Total command on key numeric fields for control/hash totals
Uniqueness of Records
It's possible that a single original record can be written more than once to a file due to processing errors.
To test for record uniqueness, use the Duplicates command, selecting all the physical fields in the "Field(s) to test for Duplicates" section of the command dialog:
Duplicates ON <field list> To "Payments_2019_DupeRecs" Open Presort
Data can become corrupted in many ways as it is being processed. Performing an extraction from a database, zipping up a file, and even sending it as an e-mail attachment are activities that can cause corruption. Numeric and date fields are particularly vulnerable.
Run the Verify command against all the physical fields. The output will tell you if any evidence of corruption has been detected:
- Non-numeric characters in numeric fields
- Invalid/blank dates
- Invisible/non-printing characters (such as tabs)
Field Level TestingNumeric and date fields are the prime candidates for quality testing. They are indicators of the scope (time period) and materiality of the data. Key character fields should also be tested.
The Statistics command provides a wealth of information for both numeric and date fields. For numeric fields, it can rapidly display the following results:
- Number/Total/Average of positives
- Number of zeros
- Number/Total/Average of negatives
- Absolute total
- Standard deviation
- Top X
It's always useful to review the Top X results to help you make your decision to test for outliers.
The Statistics command again provides significant results:
- Blank dates
- Invalid dates
- Weekend dates
- Top X dates
Invalid dates are those which contain content that does not conform to the field's date format. Details of the invalid/blank dates' contents can be displayed by the Verify command. Weekend dates may be of interest to tests of manual journal entries, where weekend or holiday transactions are considered high-risk. The Range result will immediately confirm the span of the dates; a range greater than 365 will tell you that the records cover more than a one-year period. Reviewing the Top X results can identify the existence of outliers. Test logically-related date fields with filters to determine if there are anomalies, such as an invoice date coming after a payment date in a given record. Use a filter with the Count command to document their existence: Count If Invoice_Date > Payment_Date