What are the seven dimensions of data quality?

The seven dimensions are: accuracy (data correctly represents reality), completeness (no missing values where data is needed), consistency (uniform formatting and values across datasets), timeliness (data is current and relevant), validity (data conforms to required format), uniqueness (no unnecessary duplicates), and integrity (no corruption or errors).

How do I assess if my data is ready for AI processing?

Conduct a data quality audit by evaluating your dataset against each dimension: sample records for accuracy, count missing values for completeness, check formats for consistency, verify update frequency for timeliness, validate against schema for validity, identify duplicate records for uniqueness, and check for corruption or inconsistencies for integrity. Document findings in a quality scorecard.

What percentage of data quality is acceptable for AI?

There is no universal threshold - it depends on the use case. For healthcare, 99%+ accuracy might be required. For marketing segments, 85-90% might be acceptable. The key is understanding your tolerance for errors and choosing quality levels that match the decision consequence. High-stakes decisions require higher quality data.

Chapter 6.1: Data Quality Assessment

Why Data Quality Assessment Matters

Imagine building a house on a foundation of sand. The most beautiful architecture, the finest craftsmanship, the best materials - none of it matters if the foundation is unstable. The house will eventually crack and collapse.

This is exactly what happens when you feed poor-quality data into AI systems. The most sophisticated algorithms, the largest models, the most powerful computers - none of it matters if the data is unreliable. The AI system will produce unreliable outputs, no matter how impressive the technology is.

Data quality assessment is about identifying those foundation problems before you build on them. It is the systematic evaluation of whether your data will reliably support the AI systems you want to build. Without this assessment, you are operating blind, hoping the foundation holds without actually checking.

The Assessment Mindset

Professional data practitioners approach quality assessment with healthy skepticism. Rather than assuming data is good until proven otherwise, they assume data has problems and work systematically to identify and quantify them. This defensive mindset prevents disasters.

The Seven Dimensions of Data Quality

Data quality is not a single binary property. Data is not either "good" or "bad." Instead, quality exists across seven independent dimensions, each with its own characteristics, measurement approaches, and remediation strategies. Understanding all seven is essential for comprehensive quality assessment.

Dimension 1: Accuracy

Accuracy measures whether data correctly represents reality. An accurate customer email address is one that actually works. An accurate salary is one that reflects what someone actually earns. An accurate product price is one that matches what customers actually pay.

Inaccurate data leads AI systems to learn incorrect patterns. If you train a sales forecasting model on historical revenue data that contains errors, the model learns from those errors and perpetuates them in predictions.

Assessing accuracy often requires comparing your data against an authoritative source. For customer records, you might validate email addresses by sending a confirmation email. For transaction data, you might reconcile against bank statements. For product information, you might verify against supplier documentation. The validation method depends on the data type and available reference sources.

Accuracy Validation Challenge

For many datasets, no perfect reference source exists. Historical customer data cannot be revalidated against the original source. In these cases, you establish accuracy estimates through sampling. You audit a random sample of records against available evidence (customer confirmation, historical records, business logic rules) and project the error rate across the full dataset.

Dimension 2: Completeness

Completeness measures whether all necessary data is present. A complete customer record includes all required fields: name, email, phone, address. A complete transaction record includes date, amount, product, customer, status. Completeness is measured by the percentage of required fields that have values.

Missing data is one of the most common data quality problems. Databases contain null values, blank fields, and records with incomplete information. Some missing data is innocent (a customer didn't provide a phone number). Some is problematic (incomplete transaction records due to system errors). All missing data must be identified, quantified, and resolved before AI processing.

Assess completeness by examining each field that your AI system needs. Calculate the percentage of non-null values for each field. Fields with high missing percentages are either problematic (the source system is dropping data) or unnecessary (they can be removed from your AI dataset). Fields with low missing percentages are acceptable if the missing records can be handled through deletion or imputation strategies covered in Chapter 6.2.

Dimension 3: Consistency

Consistency measures whether data is formatted uniformly across the dataset and across related datasets. Phone numbers should all follow the same format: either all (555) 555-5555 or all 5555555555, not a mixture. Dates should be in consistent format: either all MM/DD/YYYY or all DD/MM/YYYY. Product names should be spelled the same way throughout: either "iPhone 15" or "iphone15", not a mixture.

Inconsistency creates problems because AI systems interpret formatting as information. If some phone numbers have parentheses and others do not, the model might learn spurious patterns based on formatting rather than the actual phone numbers. If some dates use slashes and others use hyphens, the model must learn to handle both formats, adding unnecessary complexity.

Assess consistency by sampling records and checking formatting uniformity. Look for patterns in how data is entered. Check for spelling variations in categorical fields. Examine number formatting in numerical fields. Look for mixed cases (uppercase, lowercase, mixed case) in text fields. Consistency issues are systematic and remediation follows clear rules - standardize everything to one format.

Dimension 4: Timeliness

Timeliness measures whether data is current and updated frequently enough for your use case. A customer phone number from 2020 might be outdated. Employee salary data from last year might no longer reflect current compensation. Inventory counts from yesterday might be stale if inventory turns over hourly.

Timeliness requirements depend on your use case. Marketing analytics can tolerate data refreshed monthly. Real-time fraud detection requires data updated continuously. Strategic planning can use data refreshed quarterly. The key is understanding your tolerance for staleness and ensuring data updates at least that frequently.

Assess timeliness by understanding the data update schedule. When was each record last updated? How long ago was the overall dataset refreshed? Does the update frequency match your use case requirements? Is there a lag between when events occur and when they are recorded in the system? These questions determine whether your data is fresh enough for AI processing.

Dimension 5: Validity

Validity measures whether data conforms to the required format and structure. A phone number should be numeric and ten digits long. An email address should contain an @ symbol and a domain. An age should be numeric and within a reasonable range (0-120). A date should be in the correct date format and represent a real date.

Invalid data often results from data entry errors, system failures, or data corruption. A phone number like "INVALID" is invalid. A date like "02/30/2026" is invalid (February does not have 30 days). An email like "bob@nodomain" might be invalid if your requirements demand a TLD (top-level domain) like .com.

Assess validity by defining schemas for each field: what format is required, what range of values is acceptable, what rules must data follow? Then examine your dataset against these schemas. Tools like data validation software automatically check millions of records against defined rules. Fields with high invalidity rates indicate either data entry problems, system bugs, or incorrect schema definitions that need clarification.

Dimension 6: Uniqueness

Uniqueness measures whether data contains unnecessary duplicates. A customer database should have one record per customer. A transaction log should not duplicate the same transaction twice. A product catalog should not list the same product multiple times.

Duplicates arise from various sources: data import errors that duplicate records, system bugs that create multiple entries, manual data entry that creates records without checking for existing entries, or merging datasets that contain overlapping information.

Assess uniqueness by examining primary keys and unique identifiers. Check how many records share the same email, phone, tax ID, or other identifier that should be unique. Look for near-duplicates where records are almost identical but not exactly (slightly different spelling, transposed numbers). Quantify the duplicate rate. Even 5-10% duplication can distort AI analysis, so identification is critical before processing.

Dimension 7: Integrity

Integrity measures whether data is complete, uncorrupted, and logically consistent across relationships. A customer record's country field should match the country code in their address. An order's total should equal the sum of line items. A product's inventory should never be negative. Parent-child relationships in hierarchical data should be properly maintained.

Integrity violations often indicate data corruption, system bugs, or broken relationships. They are harder to detect than other quality problems because they require understanding the logical relationships between fields and across records. A phone number might be valid as a format but still represent data corruption if it belongs to someone who moved and changed providers.

Assess integrity by understanding your data model and business rules. Check referential integrity: do foreign keys point to existing parent records? Check business rule compliance: do field values make logical sense together? Check calculation accuracy: are computed fields correctly calculated? These deeper quality checks catch problems that simpler validation methods miss.

The Data Quality Audit Process

Systematic assessment follows a repeatable process that creates a baseline understanding of your data quality:

Step 1: Define Quality Requirements

What quality do you need? The answer depends on your use case. Financial reporting might require 99.9% accuracy. Marketing audience building might accept 90% accuracy. First-time hiring decisions might require 99%+ accuracy. Define explicit quality targets for each dimension before assessing current state.

Step 2: Sample Your Data

Audit a random sample of records rather than examining every record. A properly sized sample (typically 200-400 records for most datasets) provides statistically reliable estimates of quality issues while remaining manageable for manual review. Statistical sampling allows you to project findings across the full dataset.

Step 3: Assess Each Dimension

For each of the seven dimensions, determine whether your data meets your requirements. Calculate metrics: percentage of missing values for completeness, percentage of formatting errors for consistency, percentage of invalid values for validity, percentage of duplicates for uniqueness. Document findings in a quality scorecard.

Step 4: Identify Root Causes

When you find quality problems, investigate why they exist. Is inaccurate data due to manual entry errors or system bugs? Are missing values because the source system does not capture that information? Are duplicates from bad merge logic? Understanding root causes enables targeted remediation rather than treating symptoms.

Step 5: Quantify Impact

Estimate how quality issues will affect your AI results. If 10% of customer records have invalid email addresses, what is the impact on customer communication accuracy? If 5% of records are duplicates, how does this affect segmentation accuracy? Quantifying impact helps prioritize remediation efforts.

Creating a Data Quality Scorecard

A quality scorecard documents assessment findings in a format that communicates clearly to stakeholders. Create a simple scorecard for each dataset:

Quality Dimension	Target	Current	Status
Accuracy	98%	92%	At Risk
Completeness	99%	96%	At Risk
Consistency	100%	88%	Critical
Timeliness	Weekly refresh	Monthly refresh	At Risk
Validity	100%	99.5%	Acceptable
Uniqueness	100%	98%	Acceptable
Integrity	100%	95%	At Risk

This scorecard immediately shows which dimensions need attention before data is ready for AI processing. Consistency issues are critical and must be resolved. Multiple at-risk dimensions suggest the data needs significant preparation before use.

Key Takeaway

Data quality assessment across seven dimensions - accuracy, completeness, consistency, timeliness, validity, uniqueness, and integrity - is the foundation of successful AI. Before processing any data, systematically evaluate it against these dimensions, create a quality scorecard, and identify which problems must be addressed through data cleaning and transformation. This assessment phase separates professionals who produce reliable AI from amateurs who hope data will cooperate.

The investment in quality assessment pays enormous dividends: fewer surprises in production, faster model development, more reliable AI outputs, and dramatically reduced debugging time when problems arise.

What Comes Next

Armed with a complete understanding of your data quality, Chapter 6.2 moves into remediation. You will learn practical cleaning techniques including deduplication strategies, missing value handling methods, standardization approaches, and outlier treatment. The assessment from this chapter becomes the action plan for Chapter 6.2.

Back

Lesson 6 Overview

Next Chapter

Ch 6.2: Cleaning & Transform

Data Quality
Assessment

Why Data Quality Assessment Matters

The Seven Dimensions of Data Quality

Dimension 1: Accuracy

Dimension 2: Completeness

Dimension 3: Consistency

Dimension 4: Timeliness

Dimension 5: Validity

Dimension 6: Uniqueness

Dimension 7: Integrity

The Data Quality Audit Process

Step 1: Define Quality Requirements

Step 2: Sample Your Data

Step 3: Assess Each Dimension

Step 4: Identify Root Causes

Step 5: Quantify Impact

Creating a Data Quality Scorecard

Key Takeaway

What Comes Next

On This Page

Chapter Details

Lesson 6 Chapters

Data QualityAssessment

Why Data Quality Assessment Matters

The Seven Dimensions of Data Quality

Dimension 1: Accuracy

Dimension 2: Completeness

Dimension 3: Consistency

Dimension 4: Timeliness

Dimension 5: Validity

Dimension 6: Uniqueness

Dimension 7: Integrity

The Data Quality Audit Process

Step 1: Define Quality Requirements

Step 2: Sample Your Data

Step 3: Assess Each Dimension

Step 4: Identify Root Causes

Step 5: Quantify Impact

Creating a Data Quality Scorecard

Key Takeaway

What Comes Next

On This Page

Chapter Details

Lesson 6 Chapters

Data Quality
Assessment