The Quality Framework will address these dimensions of quality across these different levels and comprise In-Line, Off-Line tests, and automatic fallback processes.
FRAMWORK OF DATA QUALITY
There are many areas that play into the Framework of Data Quality. A shout out to Daryl Lynch for providing this structure.
1 - Governance
2 - Processing
3 - Data Quality
4 - Controls
5 - Reporting
Governance is not Data exclusive role. Business teams needs to define the Governance. This informed the front line workers that do the data capture at the process step. The Governance is less about Data requirements and more about Business Process requirements or Risk Management requirements.
Processing of data and understanding the quality at each step is very important in successfully moving to the next step and in troubleshooting any quality issues at any downstream level.
Data Quality often needs to be viewed and understood on a table by table and field by field basis. This helps when it comes to Governance and any data cleansing efforts you may need to undertake.
Controls and access allow you to better manage the data within your organization is a direct impact of the Data Quality of the platform.
Reporting, while the final piece, is where most users interact with the data. Providing clear visibility to the Quality of the data on the platform can be very beneficial to users in building trust and growing adoption.
FRAMWORK OF DATA QUALITY TESTS
IN-LINE TESTS
In-Line Tests are designed to test the quality of the data as it flows through the system. These tests impact the load times of processes, so tests must have a targeted scope to run within a brief period. Move any test that takes more than a minute to run to an Off-Line test.
At a minimum, In-Line tests should define:
Pass - continue processing
Fail - stop processing
Alert - notify administrators of a potential issue.
A fourth category of 'Retry' may be defined as part of the Fail/ Alert status.
In-Line testing for duplication is critical for both Type 1 (static/ current values) and Type 2 (historic) values. While Type 1 failures will
have a larger immediate impact, they are also easier to test and fix. Type 2 failures can be tough to resolve and can go unnoticed for extended periods.
Think of In-Line tests as Counts, Amounts, and Hash tests.
Counts- How many records should have been processed?
Amounts- What's the total value of the records processed? (Sum of defined field(s))
Hash- What's the total size of the Data processed? (Can be by file or by record)
PROTIP- Many failures can come from locked sources. A retry count of 3 over a 5 min window allows for locks to expire and processes to complete. Anything longer than 5 min can indicate a system issue, and manual review is usually recommended.
OFF-LINE TESTS
Off-Line Tests are more extended running Data Quality tests that cannot be inserted into In-Line processing without negatively impacting processing.
Off-Line tests should start to break out the Counts and Amounts by business value and compare this breakout and values to a secondary data feed explicitly designed for validation.
Note- Validation data feeds should be specifically designed to validate the breakout. To avoid circular validation fallacies (validating code/data against the same code/data), validation data should avoid any translations from processing and use different data feeds used by processing.
Common Off-Line tests:
Number of Sales People vs. Sales People in Source
Sales by Sales Person for current day/ month/ year.
Sales by Fiscal year
Sales by product for current day/ month/ year.
Off-Line validation's initial goal is to have measures broken out by every dimension in your data model. Once this is done, add additional fields from key Dimensions. Ensuring that one field from each dimension validates to a known source ensures that you first test referential integrity and then test any other mappings.
Off-Line Test Caution
Off-Line tests still need to be concerned with the time it takes to execute and the ROI from the test. While people often think that the best test is to compare record by record and field by field, the reality is that attempting to achieve these results will take an inordinate effort, time, and compute. If it takes 3-9 Months to do validation, you will be unable to leverage these validations in real-world situations.
DATA QUALITY BLOG SERIES
Each day the Data Quality Blog post will be released at 8:45 AM each day.
DATA QUALITY - Part 1 January 6th
DATA QUALITY CONCEPTS - Part 2 January 7th
DATA QUALITY FOR EVERYONE - Part 3 January 10th
DATA QUALITY FRAMEWORK - Part 4 January 11th
DATA QUALITY DEVELOPMENT - Part 5 January 12th
QUALITY DATA - Part 6 January 13th