Share

DATA QUALITY FRAMEWORK - Part 4

Christopher Wagner • January 11, 2022

If you measure it, you change it

The Quality Framework will address these dimensions of quality across these different levels and comprise In-Line, Off-Line tests, and automatic fallback processes. 


FRAMWORK OF DATA QUALITY


There are many areas that play into the Framework of Data Quality. A shout out to Daryl Lynch for providing this structure.


1 - Governance

2 - Processing

3 - Data Quality

4 - Controls

5 - Reporting


Governance is not Data exclusive role.  Business teams needs to define the Governance.  This informed the front line workers that do the data capture at the process step.  The Governance is less about Data requirements and more about Business Process requirements or Risk Management requirements.


Processing of data and understanding the quality at each step is very important in successfully moving to the next step and in troubleshooting any quality issues at any downstream level.


Data Quality often needs to be viewed and understood on a table by table and field by field basis. This helps when it comes to Governance and any data cleansing efforts you may need to undertake.


Controls and access allow you to better manage the data within your organization is a direct impact of the Data Quality of the platform.


Reporting, while the final piece, is where most users interact with the data. Providing clear visibility to the Quality of the data on the platform can be very beneficial to users in building trust and growing adoption.



FRAMWORK OF DATA QUALITY TESTS


IN-LINE TESTS 

In-Line Tests are designed to test the quality of the data as it flows through the system. These tests impact the load times of processes, so tests must have a targeted scope to run within a brief period. Move any test that takes more than a minute to run to an Off-Line test. 


At a minimum, In-Line tests should define: 

Pass - continue processing

Fail - stop processing

Alert - notify administrators of a potential issue. 


A fourth category of 'Retry' may be defined as part of the Fail/ Alert status. 

 

In-Line testing for duplication is critical for both Type 1 (static/ current values) and Type 2 (historic) values. While Type 1 failures will 

have a larger immediate impact, they are also easier to test and fix. Type 2 failures can be tough to resolve and can go unnoticed for extended periods. 

 

Think of In-Line tests as Counts, Amounts, and Hash tests. 

 

Counts- How many records should have been processed? 

 

Amounts- What's the total value of the records processed? (Sum of defined field(s)) 

 

 Hash- What's the total size of the Data processed? (Can be by file or by record) 

 

PROTIP- Many failures can come from locked sources. A retry count of 3 over a 5 min window allows for locks to expire and processes to complete. Anything longer than 5 min can indicate a system issue, and manual review is usually recommended. 

 

OFF-LINE TESTS 

Off-Line Tests are more extended running Data Quality tests that cannot be inserted into In-Line processing without negatively impacting processing. 

 

Off-Line tests should start to break out the Counts and Amounts by business value and compare this breakout and values to a secondary data feed explicitly designed for validation. 

 

Note- Validation data feeds should be specifically designed to validate the breakout. To avoid circular validation fallacies (validating code/data against the same code/data), validation data should avoid any translations from processing and use different data feeds used by processing. 

 

Common Off-Line tests:

Number of Sales People vs. Sales People in Source 

 

Sales by Sales Person for current day/ month/ year. 

 

Sales by Fiscal year 

 

Sales by product for current day/ month/ year.

 

 

Off-Line validation's initial goal is to have measures broken out by every dimension in your data model. Once this is done, add additional fields from key Dimensions. Ensuring that one field from each dimension validates to a known source ensures that you first test referential integrity and then test any other mappings.

 

 

Off-Line Test Caution

Off-Line tests still need to be concerned with the time it takes to execute and the ROI from the test. While people often think that the best test is to compare record by record and field by field, the reality is that attempting to achieve these results will take an inordinate effort, time, and compute. If it takes 3-9 Months to do validation, you will be unable to leverage these validations in real-world situations. 



DATA QUALITY BLOG SERIES

Each day the Data Quality Blog post will be released at 8:45 AM each day.


DATA QUALITY - Part 1 January 6th

DATA QUALITY CONCEPTS - Part 2 January 7th

DATA QUALITY FOR EVERYONE - Part 3 January 10th

DATA QUALITY FRAMEWORK - Part 4 January 11th

DATA QUALITY DEVELOPMENT - Part 5 January 12th

QUALITY DATA - Part 6 January 13th




CHRIS WAGNER, MBA MVP

Analytics Architect, Mentor, Leader, and Visionary

Chris has been working in the Data and Analytics space for nearly 20 years. Chris has dedicated his professional career to making data and information accessible to the masses. A significant component in making data available is continually learning new things and teaching others from these experiences. To help people keep up with this ever-changing landscape, Chris frequently posts on LinkedIn and to this blog.
By Christopher Wagner September 3, 2024
Your guide to becoming a Data Engineer.
By Christopher Wagner August 19, 2024
Compare Microsoft Fabric and Databricks, two leading data platforms. Highlights their features, strengths, and unique offerings across various domains like data engineering, data analytics, data science, DevOps, security, integration with other tools, cost management, and governance. Microsoft Fabric is noted for its low-code/no-code solutions and seamless integration with Microsoft tools, making it accessible for users with varying technical skills. Databricks is praised for its high-performance capabilities in big data processing and collaborative analytics, offering flexibility and control for experienced data teams.
By Christopher Wagner November 15, 2023
In a dynamic data engineering scenario, Sam, a skilled professional, adeptly navigates urgent requests using Microsoft Fabric. Collaborating with Data Steward Lisa and leveraging OneLake, Sam streamlines data processes, creating a powerful collaboration between engineering and stewardship. With precision in Azure Data Factory and collaboration with a Data Scientist, Sam crafts a robust schema, leading to a visually appealing Power BI report.
By Christopher Wagner April 28, 2023
NOTE: This is the first draft of this document that was assembled yesterday as a solo effort. If you would like to contribute or have any suggestions, check out my first public GIT repository - KratosDataGod/LakehouseToPowerBI: Architectural design for incorporating a Data Lakehouse architecture with an Enterprise Power BI Deployment (github.com) This article is NOT published, reviewed, or approved by ANYONE at Microsoft. This content is my own and is what I recommend for architecture and build patterns.
Show More
Share by: