Chapter 6.4: Privacy-Preserving Data Preparation

Why Privacy Protection is Critical

Privacy is not just ethical - it is increasingly legal. GDPR in Europe, CCPA in California, and emerging regulations globally restrict how organizations can collect, use, and process personal data. Regulations require that data be used only for stated purposes, that individuals can request deletion, and that security measures protect against unauthorized access.

Beyond compliance, privacy protection is fundamental to responsible AI. Data represents real people, and those people deserve protection. AI practitioners must build privacy protection into data preparation, not as an afterthought.

This chapter teaches practical techniques for preparing data in privacy-preserving ways: anonymization that removes personally identifying information, pseudonymization that replaces names with codes while preserving linkages, data minimization that includes only necessary fields, and governance practices that control access and track usage.

Privacy vs Utility Tradeoff

Every privacy protection technique reduces what you can do with data. Anonymization loses ability to update records. Aggregation loses individual insights. This tradeoff is unavoidable. The key is achieving sufficient privacy protection for your use case while retaining enough utility to make the AI valuable. There is no universal answer - only tradeoffs to be understood and explicitly chosen.

Anonymization: Removing Identifying Information

Anonymization removes information that directly identifies individuals. Names, social security numbers, email addresses, phone numbers, dates of birth - all identifying information. Properly anonymized data cannot be linked back to individuals.

Removing Direct Identifiers

Direct identifiers are fields that explicitly identify people: names, SSN, email, phone, government ID. Removing these fields is the first step in anonymization. If your analysis does not need individual names, delete the name field. If you need customer purchase patterns but not individual identity, delete the name but keep purchase amounts and dates.

Managing Quasi-Identifiers

Quasi-identifiers are fields that can identify individuals when combined: ZIP code + age + gender might uniquely identify someone in a small town. A patient record with ZIP code + age + admission date might identify a specific person if the hospital is small. Quasi-identifiers require careful handling.

Techniques include: generalization (replace specific values with ranges: age 27 becomes "20-30"), suppression (delete sensitive combinations), and aggregation (report only summary statistics, not individual records). The goal is ensuring that re-identification is impossible or impractical.

K-Anonymity Standard

K-anonymity is a formal privacy standard: each combination of quasi-identifiers appears in at least k records. If you have a dataset of 100,000 customers and each ZIP+age+gender combination appears at least 100 times (k=100), an attacker cannot identify individuals even if they know someone's ZIP, age, and gender.

To achieve k-anonymity, generalize quasi-identifiers until each combination is sufficiently common. Age 25, ZIP 94301, Gender Female might be too specific (k=5). Age 20-30, ZIP 943xx, Gender Female might have k=450, meeting the standard.

Pseudonymization: Replacing Identifiers with Codes

Pseudonymization replaces direct identifiers (names, SSN) with opaque codes while preserving the ability to link records together. A customer named John Smith becomes Customer_12849. All records for this customer can still be linked using the code, but the code reveals no information about the person's identity.

Implementing Pseudonymization

Step 1: Create a key table mapping original identifiers to pseudonyms. Original SSN "123-45-6789" maps to Pseudonym "PSE_09234". Keep this key table secure and separate from the pseudonymized data.

Step 2: Replace identifiers in your dataset using the key table. Any reference to the original identifier becomes the pseudonym.

Step 3: Secure the key table Access should be extremely restricted. Ideally, even people running analysis on pseudonymized data do not have access to the key table.

Step 4: Document linkages carefully Only authorized personnel should know how to re-identify individuals using the key table.

Advantages of Pseudonymization

Pseudonymization allows longitudinal analysis (tracking individuals over time) while protecting identity. You can see that Customer_12849 made 5 purchases and spent $2,000 total, without knowing Customer_12849 is John Smith. This enables personalization and analysis that complete anonymization does not allow.

Data Minimization: Including Only What is Needed

Data minimization is the principle of collecting and retaining only data necessary for your stated purpose. If your purpose is "predict customer churn," do you need phone number? Do you need date of birth? Do you need address? Probably not. Minimize data to only purchase history, support interactions, and account age.

Purpose Limitation

Data collected for one purpose should not be repurposed without additional justification and consent. Customer data collected for "sending order confirmations" should not be repurposed for "marketing campaigns" without asking permission first. Each use of data has implications for what information is needed.

Retention Limits

Do not retain data longer than necessary. After a customer leaves, do you need to keep their purchase history forever? Probably not. Define retention periods: keep operational data for 2 years, archive for legal compliance for 7 years, then delete. Shorter retention means less exposed data if you suffer a breach.

Access Control and Governance

Even anonymized or pseudonymized data requires access controls. Who can access the data? For what purposes? How is access logged and audited?

Role-Based Access Control

Grant access based on job role. Data analysts can access aggregated customer data. Product managers can see anonymized usage patterns. Sales leadership can see regional sales summaries. No one access is restricted to what they need for their role. Marketing team members do not need access to customer financial data.

Access Logging and Auditing

Every access to sensitive data should be logged: who accessed it, when, what they accessed, what they did. Regular audits should review access logs for suspicious patterns. Someone accessing thousands of customer records at 3 AM warrants investigation.

Data Classification

Classify data by sensitivity: public (sales metrics), internal (employee data), sensitive (customer financial data), or restricted (health information, biometric data). Different sensitivity levels require different protection levels. Public data might live in databases accessible to many. Restricted data should be encrypted, access logged, and protected with strong authentication.

Regulatory Compliance Frameworks

GDPR protects EU residents' data rights. Key obligations: lawful basis for processing (consent, contract, legitimate interest, legal obligation), right to access (individuals can request what data you hold), right to deletion (individuals can request deletion), data portability (individuals can request their data in portable format), and privacy by design (privacy must be built into systems, not added later).

CCPA (California Consumer Privacy Act)

CCPA protects California residents. Similar to GDPR but with some differences: right to know (what data is collected), right to delete, right to opt-out of sale. CCPA defines "sale" broadly to include sharing data for any consideration, not just direct payment.

Compliance Practices

Document your legal basis for processing each dataset. Implement data access requests: when someone requests their data, can you retrieve it? Implement deletion: when someone requests deletion, can you remove their data from all systems? Conduct privacy impact assessments for new data uses. Train staff on data handling policies.

Advanced Privacy Techniques

Differential Privacy

Differential privacy is a mathematical framework ensuring that analysis results do not reveal whether any individual's data was included. By adding mathematical noise to analysis results, you ensure that attackers cannot determine whether a specific person's data was in the dataset. Works well for aggregate analysis like "what is average customer age" but less well for individual-level prediction.

Federated Learning

Rather than centralizing data, train models across distributed data sources without moving raw data. A hospital trains a medical AI model using its own data without sending data to a central server. Models are aggregated centrally but raw data never leaves the hospital. Excellent privacy protection for sensitive data.

Encryption and Security

Encrypt sensitive data at rest (when stored) and in transit (when transmitted). Encryption ensures that even if data is stolen, it is unreadable without the encryption key. End-to-end encryption means encryption keys are held by individuals, preventing even service providers from accessing unencrypted data.

Key Takeaway

Privacy-preserving data preparation is not an obstacle to AI - it is essential to responsible, legally compliant AI. Anonymization removes identifying information when you do not need to track individuals. Pseudonymization replaces names with codes while preserving linkages. Data minimization reduces risk by keeping only necessary information. Access controls ensure that sensitive data is handled by authorized users only. Regulatory frameworks like GDPR and CCPA impose legal requirements that must be understood and followed.

The goal is not perfect privacy - that is incompatible with analysis. The goal is responsible protection that respects individual privacy while enabling valuable analysis. Achieving this balance requires understanding your legal obligations, your organizational risk tolerance, and your technical options.

What Comes Next

Lesson 6 is complete. You now understand data preparation from assessment through cleaning, structuring, and privacy protection. These foundational skills enable all subsequent AI work. Lesson 7 shifts to a different dimension: how to share AI knowledge with your organization through training and community building.

Ch 6.3: Structuring Data

Next Lesson

Lesson 7: Collaboration

Privacy-Preserving
Data Preparation

Why Privacy Protection is Critical

Anonymization: Removing Identifying Information

Removing Direct Identifiers

Managing Quasi-Identifiers

K-Anonymity Standard

Pseudonymization: Replacing Identifiers with Codes

Implementing Pseudonymization

Advantages of Pseudonymization

Data Minimization: Including Only What is Needed

Purpose Limitation

Retention Limits

Access Control and Governance

Role-Based Access Control

Access Logging and Auditing

Data Classification

Regulatory Compliance Frameworks

CCPA (California Consumer Privacy Act)

Compliance Practices

Advanced Privacy Techniques

Differential Privacy

Federated Learning

Encryption and Security

Key Takeaway

What Comes Next

On This Page

Chapter Details

Lesson 6 Chapters

Privacy-PreservingData Preparation

Why Privacy Protection is Critical

Anonymization: Removing Identifying Information

Removing Direct Identifiers

Managing Quasi-Identifiers

K-Anonymity Standard

Pseudonymization: Replacing Identifiers with Codes

Implementing Pseudonymization

Advantages of Pseudonymization

Data Minimization: Including Only What is Needed

Purpose Limitation

Retention Limits

Access Control and Governance

Role-Based Access Control

Access Logging and Auditing

Data Classification

Regulatory Compliance Frameworks

GDPR (European General Data Protection Regulation)

CCPA (California Consumer Privacy Act)

Compliance Practices

Advanced Privacy Techniques

Differential Privacy

Federated Learning

Encryption and Security

Key Takeaway

What Comes Next

On This Page

Chapter Details

Lesson 6 Chapters

Privacy-Preserving
Data Preparation