Why Data Structure Matters
Clean data can still be poorly structured. A customer database might have every record clean, deduplicated, and standardized, but structured as a series of emails and attachments. An analyst system might have accurate transaction data but structured as free-form notes rather than fields. Structure determines what AI can do with the data.
Different AI tools require different structures. Large language models work with text, so document-oriented formats work well. Machine learning models typically work with structured tables. Computer vision models require images in specific formats. Getting structure right for your specific use case is critical.
This chapter covers selecting the right data format for your use case, designing schemas that AI systems can work with effectively, techniques for converting unstructured data into structured formats, and enrichment strategies that add context to raw data.
Well-structured data enables automatable processing. Structured data allows tools to validate, analyze, and process without human intervention. Poorly structured data requires manual interpretation and context, limiting what automation can accomplish. Structure is the foundation of scalable AI.
Common Data Formats
CSV (Comma-Separated Values)
CSV is the most universal format for tabular data. Each row is a record, each column is a field, values are separated by commas. Simple, portable, works everywhere. Limitations: CSV cannot represent hierarchical or nested structures, lacks schema information, cannot handle commas or special characters without quoting. Use CSV for simple tabular data without complex relationships.
JSON (JavaScript Object Notation)
JSON represents data as nested objects and arrays. Flexible, can represent hierarchical structures, human-readable, widely supported by APIs and modern tools. JSON works well for semi-structured data with varied fields across records. Use JSON for API responses, document databases, and complex nested structures.
XML (eXtensible Markup Language)
XML uses tags to describe data meaning. Verbose but very explicit about structure and relationships. Widely used in enterprise systems and legacy applications. Less popular for new AI projects but important for integrating with existing systems. Use XML when working with enterprise data sources or when schema validation is critical.
Parquet (Columnar Format)
Parquet stores tabular data in columnar format, optimized for analytical queries. More efficient than CSV for large datasets because it only loads needed columns. Compressed, with built-in schema. Standard in big data tools like Spark. Use Parquet for large analytical datasets processed by Spark or similar tools.
Choosing Your Format
Choose format based on your use case: simple tabular data → CSV, hierarchical or API data → JSON, enterprise integration → XML, large analytical datasets → Parquet. Consider tool requirements, file size, complexity, and whether schema needs to be shared with others.
Schema Design Principles
A schema defines the structure of your data: what fields exist, what type each field is, what fields are required, what values are allowed. Good schema design ensures data is interpretable and processable.
Field Naming
Use clear, descriptive field names. "customer_email" is better than "ce" or "cust_e". Use consistent naming conventions: all lowercase with underscores (customer_email) or camelCase (customerEmail), not mixed. Avoid spaces and special characters in field names. Include units in field names when relevant: "weight_kg" vs just "weight".
Field Types
Define appropriate types for each field. Numeric fields (integer or decimal), text fields (string), dates, booleans, and categorical types. Proper typing enables data validation and processing. A field defined as integer will reject "unknown" entries, helping catch data quality problems.
Required vs Optional
Mark fields as required or optional. Required fields must have values in every record. Optional fields might be null. This guides your data validation and missing value handling strategies.
Enumerations and Allowed Values
For categorical fields, define allowed values explicitly. Instead of just marking a field as "string type", define that product_type can only be "Electronics", "Clothing", "Food", or "Other". This constrains data and catches typos automatically.
Converting Unstructured Data
Much valuable organizational data exists as unstructured text: emails, documents, notes, PDFs. Converting this to structured format unlocks AI capabilities.
Manual Extraction
For small volumes, manually read documents and enter structured data. Labor-intensive but reliable. Use for data that is too complex for automation or for critical data that must be 100% accurate.
Template-Based Extraction
Create form templates that guide humans to extract key information consistently. More structured than free-form notes, faster than building custom extraction logic. Works well for moderately structured sources like customer feedback or incident reports.
Rule-Based Extraction
Write rules to extract patterns from text. "If text contains 'total: $', extract the number following." Works well for semi-structured sources with predictable formats. Limited to predetermined patterns.
ML-Based Extraction
Train machine learning models to extract information from text. Can handle variations and patterns humans might not anticipate. Requires labeled training data. Worth the investment for large-scale extraction projects.
LLM-Based Extraction
Use large language models to extract and structure information from documents. Can handle complex documents and varied formats. Works surprisingly well without training data. Test thoroughly to ensure accuracy before relying on extracted data.
Data Enrichment and Contextualization
Raw data often lacks context. Adding relevant context dramatically improves AI model quality. A purchase record has more predictive value if enriched with customer demographics, previous purchase history, and product information.
Enrichment Through Joins
Combine data from multiple sources using common identifiers. Combine customer purchase data with customer demographic data using customer ID. Combine transaction data with product information using product code. Join geographic data based on addresses or coordinates.
Calculated Enrichment
Add new fields calculated from existing ones. Days since last purchase, customer lifetime value, product margin, or purchase frequency. These calculated features often have strong predictive power.
Temporal Enrichment
Add time-based context: is it a holiday, is it tax season, is it the customer's birthday, what's the day of week, what's the quarter? Temporal features often capture seasonal patterns important for forecasting.
External Data Enrichment
Enrich data with external sources: weather, economic indicators, competitor information, regulatory changes. External data adds signals that might not be obvious in your internal data alone.
Quality Considerations in Structuring
Structure choices affect data quality and your ability to maintain it. Some formats make validation easier. Some structures make it easier to track data lineage. Some schema designs make it obvious when data is wrong.
Include schema documentation that explains field meaning, valid values, and source. Include creation and update timestamps so you can track when data was added or changed. Include a source field indicating where data originated, helpful for debugging issues.
Key Takeaway
Data structure determines what AI can accomplish. Select formats appropriate for your use case and tools. Design schemas with clear field names, appropriate types, and explicit constraints. Convert unstructured data systematically, choosing methods appropriate for complexity and volume. Enrich data with relevant context that improves model quality. Structure is not about perfection—it is about clarity, consistency, and enabling automated processing.
Investing time in proper structure upfront saves effort later and enables more sophisticated AI applications than poorly structured raw data could ever support.
What Comes Next
With clean, well-structured data prepared, Chapter 6.4 addresses privacy protection. Before using data in any AI system, you must ensure it complies with privacy regulations and protects individual privacy while preserving analytical value.