Chapter 6.2: Cleaning & Transformation

Why Data Cleaning is Critical

You assessed your data quality in Chapter 6.1 and discovered the problems. Now comes the hard work: fixing them. Data cleaning is where theory meets reality. It is unglamorous, detailed work that requires patience and systematic thinking. It is also absolutely critical.

The common saying in data science is "garbage in, garbage out." A better version is "garbage in, garbage out, after spending weeks on analysis." Cleaning your data upfront prevents weeks of wasted effort later on models built from unreliable foundations.

This chapter covers the most common cleaning challenges and proven techniques for addressing them. You will learn deduplication strategies that identify duplicate records, missing value handling that prevents analysis from breaking, standardization that brings consistency to messy fields, and outlier treatment that identifies unusual values that might distort analysis.

Document Your Decisions

Every cleaning decision should be documented. When you delete duplicates, log which fields constituted the identity. When you impute missing values, document the method and the percentage imputed. This documentation becomes crucial if the data needs to be re-cleaned, if stakeholders question your decisions, or if you need to explain your process to data governance teams.

Deduplication: Identifying and Removing Duplicates

Duplicate records inflate dataset size and distort analysis. A customer database with 100,000 records that are actually 85,000 unique customers with 15,000 duplicates leads to systematic errors. Models trained on this data will learn patterns from duplicated examples, causing overfitting to those specific records.

Exact Duplicates

Exact duplicates are records where every field matches another record perfectly. These are the easiest to identify and remove. Most databases and tools have built-in duplicate detection that identifies rows where all values match. The challenge is deciding which duplicate to keep and which to delete. Keep the first occurrence and delete subsequent ones, or examine timestamps and keep the most recent entry. Document your choice.

Near Duplicates

Near duplicates are trickier. A customer named "John Smith" living at "123 Main St, Boston MA" might also appear as "Jon Smith" at "123 Main Street, Boston Massachusetts." These are almost certainly the same person, but not an exact match. Identifying near duplicates requires fuzzy matching algorithms that identify records that are very similar but not identical.

Fuzzy matching compares fields using string similarity metrics. If two records have customer names that are 95% similar and addresses that are 90% similar, they are probably the same person. Set thresholds for what constitutes a match based on your domain knowledge. Be conservative at first (require 99% similarity) and gradually relax thresholds if you are not catching obvious duplicates.

The Deduplication Process

Step 1: Identify Key Fields - Which fields define uniqueness? For customers, this might be email or phone. For transactions, this might be date, amount, and account. For products, this might be SKU or UPC.

Step 2: Run Exact Match - Use database tools to find exact duplicates on key fields. Remove these first.

Step 3: Run Fuzzy Match - Apply fuzzy matching on key fields with high similarity thresholds. Manually review matches that exceed threshold to confirm before deletion.

Step 4: Document & Validate - Document all duplicates removed, including counts and methods. Validate that remaining records are indeed unique.

Handling Missing Values

Missing values are one of the most common data quality problems. A customer database might have null values in phone fields (25% of records), missing in secondary address fields (40% of records), or missing email addresses (5% of records). Each null represents a decision point: delete the record, estimate the value, or leave it as-is.

Understanding Missing Value Patterns

Before deciding how to handle missing values, understand why they are missing. Missing values fall into three categories: Missing Completely At Random (MCAR) where the missingness has no pattern, Missing At Random (MAR) where missingness relates to other fields, and Missing Not At Random (MNAR) where the missingness itself carries information. A customer's phone number might be MNAR if customers who don't provide phone numbers systematically differ from those who do.

Strategy 1: Deletion

The simplest approach is deleting records with missing values. If only 1% of records have missing data, deletion is reasonable. If 30% of records have some missing value, deletion is wasteful. Use deletion when missing data is rare and the lost information is not critical to your analysis.

Strategy 2: Imputation

Imputation replaces missing values with estimates. Common methods include:

Mean/Median imputation: Replace missing numeric values with the mean (average) or median (middle value) of observed values. Simple but may distort distributions.
Forward fill: Use the previous value in time series data. Works well for time-series where values change gradually.
Regression imputation: Build a model to predict the missing value based on other fields. More sophisticated but requires building sub-models.
Domain-based imputation: Use domain knowledge. Missing age might be imputed with population average age for the region, occupation, or other demographic segment.

Strategy 3: Missing Indicator

Rather than imputing the missing value, create a new binary field that indicates whether the original field was missing. This preserves the information that a value was missing, which itself might be predictive. A customer with a missing phone number might have different behavior than one with a phone number.

Standardization: Bringing Consistency to Data

Consistency problems make data unusable for AI systems. Phone numbers formatted as (555) 555-5555, 555-555-5555, and 5555555555 are identical information in three different formats. AI systems interpret format as information, so standardizing everything to one format is critical.

Text Standardization

Text standardization creates uniform formatting. For phone numbers, choose one format and convert all others. For dates, standardize to ISO 8601 format (YYYY-MM-DD). For addresses, use USPS abbreviations for states (CA not California) and standardize the address format. For names, standardize capitalization (all Title Case or all lowercase), handle titles consistently (Dr., Dr, etc.), and standardize spacing.

Categorical Standardization

Categorical fields like product types, regions, or status values often have inconsistent spelling or formatting. A product catalog might list "iphone15", "iPhone 15", "iPhone15Pro", and "iphone 15" as separate products. Standardize these to a single canonical value. Build a standardization lookup table that maps every variation to the correct value.

Numeric Standardization

Numeric fields should have consistent units and precision. Prices should be in the same currency. Weights should use the same unit (pounds or kilograms, not both). Percentages should use consistent scales (0-100 or 0-1, not both). Convert everything to a single standard before processing.

Outlier Detection and Treatment

Outliers are extreme values that differ significantly from the rest of the data. A customer with a $1,000,000 purchase when typical purchases are $100 is an outlier. An employee with 200 hours worked in a week when maximum is 40 is an outlier. Some outliers represent real, valid data. Others represent errors or fraud.

Detecting Outliers

Statistical methods identify values that fall outside expected ranges. The 3-sigma rule marks values more than 3 standard deviations from the mean as outliers. The IQR (Interquartile Range) method marks values below Q1-1.5*IQR or above Q3+1.5*IQR as outliers. These methods work well for normally distributed data but less well for skewed distributions.

Domain-based methods use business rules to identify outliers. A salary exceeding $1,000,000 might be an outlier flag for most jobs. Employee hours exceeding 60 per week might trigger review. A customer spending 10x their average order value might be flagged for verification.

Treating Outliers

Keep them as-is: If outliers are real and valid, keep them. Do not artificially remove real data from your dataset.

Cap or floor: Replace extreme outliers with reasonable maximum or minimum values. Cap salaries at the 99th percentile, floor ages at 18 for employment records.

Remove them: If outliers represent errors or fraud, delete those records.

Flag them: Create a binary field indicating whether a record is an outlier. Let the AI model learn how to handle them based on the flag.

Quality Control During Cleaning

Cleaning data creates risk. You might accidentally delete valid records, impute incorrect values, or introduce new errors through standardization logic. Implement quality controls to catch these problems:

Row counts: Track how many records you started with and how many you have after each cleaning step. Large drops might indicate problems with your cleaning logic.

Value distributions: Compare the distribution of values before and after cleaning. If you are imputing missing values with means, the distribution should be similar but with less variance.

Random sampling: After cleaning, sample 50-100 records and manually review them. Look for obvious errors or concerning patterns that your automated checks missed.

Validation against source: Sample records from your cleaned dataset and validate them against the original source system. If 10 sampled records all check out, you probably did not introduce errors.

Documenting Data Lineage

Data lineage is the documented path of how data flowed from source through cleaning to final form. Maintain a lineage document that shows: original data source and quality assessment, cleaning steps applied with dates and responsible party, business logic for handling issues, and final output characteristics. This documentation enables you to:

Explain to stakeholders what was done and why
Reproduce the cleaning process if needed
Debug problems if cleaned data is found to have issues
Defend your decisions if challenged by data governance or compliance teams
Train others on your cleaning procedures

Key Takeaway

Data cleaning is the bridge between data quality assessment and AI-ready datasets. Deduplication eliminates duplicate records that distort analysis. Missing value handling decides whether to delete, impute, or flag incomplete records. Standardization brings consistency to formatting and categorization. Outlier treatment carefully distinguishes real anomalies from errors. Throughout cleaning, maintain documentation of your decisions and rigorous quality control to ensure that your cleaning process does not introduce new errors.

The time invested in careful, documented cleaning pays enormous dividends in more reliable AI models, faster model development, and fewer surprises in production.

What Comes Next

With clean data in hand, Chapter 6.3 addresses structuring. Different AI systems and use cases require different data formats and structures. You will learn how to convert unstructured data into AI-ready formats, design effective schemas, and prepare data that improves AI processing accuracy and quality.

Ch 6.1: Quality Assessment

Next Chapter

Ch 6.3: Structuring Data

Cleaning &
Transformation

Why Data Cleaning is Critical

Deduplication: Identifying and Removing Duplicates

Exact Duplicates

Near Duplicates

The Deduplication Process

Handling Missing Values

Understanding Missing Value Patterns

Strategy 1: Deletion

Strategy 2: Imputation

Strategy 3: Missing Indicator

Standardization: Bringing Consistency to Data

Text Standardization

Categorical Standardization

Numeric Standardization

Outlier Detection and Treatment

Detecting Outliers

Treating Outliers

Quality Control During Cleaning

Documenting Data Lineage

Key Takeaway

What Comes Next

On This Page

Chapter Details

Lesson 6 Chapters

Cleaning &Transformation

Why Data Cleaning is Critical

Deduplication: Identifying and Removing Duplicates

Exact Duplicates

Near Duplicates

The Deduplication Process

Handling Missing Values

Understanding Missing Value Patterns

Strategy 1: Deletion

Strategy 2: Imputation

Strategy 3: Missing Indicator

Standardization: Bringing Consistency to Data

Text Standardization

Categorical Standardization

Numeric Standardization

Outlier Detection and Treatment

Detecting Outliers

Treating Outliers

Quality Control During Cleaning

Documenting Data Lineage

Key Takeaway

What Comes Next

On This Page

Chapter Details

Lesson 6 Chapters

Cleaning &
Transformation