By Edmond Leung
Published on November 14, 2024
Data quality is essential for any organization that relies on accurate and reliable information. Poor data quality has a massive financial impact on businesses. In 2021, Gartner estimated that bad data costs organizations an average of $12.9 million annually, and IBM pegged the U.S. cost at $3.1 trillion back in 2016 (TechTarget). While the importance of clean data is clear, how do we ensure it stays that way over time? Monitoring data quality can range from simple checks to increasingly sophisticated strategies that assess both the data itself and the processes generating it.
In this blog, we’ll explore a comprehensive approach to monitoring data quality, starting with basic column-level checks and building up to advanced trend analysis and seasonal monitoring techniques. By following this progression, you'll be able to ensure your data remains consistent, complete, and reliable at every stage.
At its most fundamental level, data quality monitoring starts with validating individual columns in your dataset. These checks ensure that the data falls within expected parameters and doesn’t contain any obvious errors. Types of basic column-level monitoring include:
Comparative Checks: Use simple mathematical calculations to ensure values in a column adhere to predefined rules. For numerical or date/time data types, check if all values in a column fall within a specific range or meet conditions. For example, checking that age is greater than zero or ensuring order dates are within a specific year.
Range Analysis: Calculate key metrics such as the minimum (MIN), maximum (MAX), average (AVG), median (MED), mode (MOD), and standard deviation (STD) for a column. For instance, some financial data is expected to have low variation. A high standard deviation could signal an issue such as manual entry errors.
Null Values: Track the number of null values or missing data to ensure completeness. Too many null values or missing data makes a column unreliable or simply useless. An unexpectedly high number of nulls in a column may indicate a data entry issue or that some source of data is arriving late, causing table joins to fail.
Distinct Values: Monitoring the number of distinct values in a column is useful for detecting inconsistencies. For example, if a column representing customer IDs suddenly shows more distinct values than expected, it could point to a duplication issue or corruption in the data.
Once individual columns are validated, it’s important to assess the integrity of the dataset as a whole. These checks focus on the relationships between columns, rows, and their alignment with the expected structure. Various types of table-level monitoring include:
Number of Rows: A sudden increase or decrease in the total number of rows beyond an expected range could indicate missing data or accidental duplications.
Consistency Between Columns: For example, ensure that a delivery date is always after the order date or that a customer’s age aligns with their birth year. Inconsistencies between related columns often signal deeper data quality issues.
Grouped Column Statistics: Analyze column statistics, using “grouped by dimensions” from another column. For instance, you could calculate the average sales amount “grouped by region” or the number of orders “grouped by product category” to detect anomalies or outliers within specific segments.
Row Duplication: Duplicate rows often occur at the overlap between successive ETL batches, and can inflate or distort results. Use checks to detect and remove row duplication, particularly in datasets where each row should be unique.
Data Freshness: Data tables are often refreshed or appended with new data on a regular cadence. If data becomes stale because it is not refreshed or updated, it is no longer usable even if the data was correct when it was first collected. If there is a dependency on the stale tables for downstream processing, the problem will propagate and affect the quality of other downstream tables.
Beyond static checks, trends can provide valuable insights into the consistency of data over time. Monitoring key metrics for trends, (often called time series analysis), will identify patterns, trends, and anomalies in a sequence of data points collected over time. This technique allows organizations to detect and address gradual shifts in data quality. Types of trend monitoring may include:
Detecting Anomalies and Outliers: Time series analysis can highlight unexpected changes in data, known as anomalies or outliers. These anomalies can indicate data quality issues, such as errors in data collection or processing. For example, if there’s a sudden spike in sales on a random day without any external cause (like a holiday or promotion), it could be an error in data entry or a system malfunction.
Monitoring for Drift: Drift in time series data occurs when statistical properties, like mean or variance, shift gradually over time. This could indicate unexpected changes in the data collection process, a shift in data source quality, or unaccounted changes in the underlying business process. If the average transaction amount of an e-commerce site starts increasing gradually over several months without a clear business reason (e.g., new premium product line), it could be due to incorrect data entry or system glitches inflating transaction values.
Analyzing data for cyclical patterns (seasonality, time series): Many datasets have expected cyclical patterns (such as weekly, monthly, or seasonal cycles). Analyzing these patterns helps distinguish between regular fluctuations and actual data quality issues. Time series analysis breaks down variations into three main components: trend, seasonality, and residual (random noise). Seasonality may be a combination of multiple different cyclical patterns. By isolating the components, unexpected anomalies are flagged as potential data-quality issues rather than seasonality. For example: If sales volume for a retail business this December is unusually low, it may indicate missing or incomplete data rather than an actual drop in demand.
Data is often consumed in the form of metrics. Perceived data quality problems may not come from the data itself, but from how the metrics are defined and calculated. Such perceived problems may be revealed through higher level or common sense observations that data users often rely on to determine the trustworthiness of their data. So how can data leaders identify such metrics definition problems? This involves checking that the metrics are properly correlated and designed in an unbiased way, and then comparing them with similar data to ensure consistency. Common sources for metrics definition issues include:
Ambiguity and Inconsistency: If metrics are defined in vague or inconsistent terms, different teams or stakeholders might interpret them differently. Without data governance standards to define and validate metrics consistently, it becomes challenging to maintain data quality across the organization. For instance, a metric like “customer churn rate” may be calculated using various time frames or criteria by different departments. By defining shared, data-centric terms in a centralized location, like a business glossary, businesses can ensure consistency in how these terms are calculated, creating trust.
Data or definition bias: Underrepresentation of certain groups or segments, often because of an assumption that “one-size-fits-all” metrics apply universally. For example, unexpectedly high "average customer spending" can be due to exceptionally high spending of a few customers from wealthy neighborhoods. Segmenting by demographic groups can reveal important patterns between groups.
Another powerful tool to address metrics inconsistencies is metrics correlation. Analyzing the correlation between metrics is a powerful technique for identifying potential data quality issues because it helps to reveal unexpected relationships (or the lack thereof) that may indicate errors. Many datasets have metrics that should correlate in predictable ways. For example, sales volume and revenue should positively correlate since higher sales generally lead to more revenue. If you find a weak or unexpected correlation, it might suggest data entry errors, missing data, or calculation errors in one or both metrics.
Effective data quality monitoring is a journey that evolves from essential checks on individual columns to advanced, multi-faceted analyses that track trends, seasonal patterns, and metric integrity. By implementing increasingly sophisticated data monitoring techniques, organizations can stay ahead of potential issues, ensuring their data remains accurate, reliable, and actionable.
Maintaining high-quality data is not a "set-it-and-forget-it" task. Regular monitoring across various dimensions helps ensure that data continues to serve its purpose without introducing risks. As your data strategy evolves, so should your approach to monitoring, making it a key part of your organization’s long-term data management success.
Curious to learn how a data catalog can help you improve data quality? Read about Alation's Open Data Quality Framework, or book a demo with us to learn more.