Data Validation and Quality Control Guide

This guide explains how data quality validation works in the Pipeline Generator Ingestion system and what happens when data passes or fails validation checks.

Overview
How Data Validation Works
Data Quality Checks
Understanding Validation Results
Configuring Data Quality Tests
What Happens When Validation Fails
Best Practices
Troubleshooting Common Issues

Overview

The Pipeline Generator Ingestion system ensures data quality by validating all incoming data before it is stored in your destination. This "validate-before-commit" approach prevents bad data from entering your data warehouse or analytics systems.

Key Benefits

✅ Data Quality Guarantee: Only validated data reaches your destination
✅ Early Detection: Problems are caught before they affect downstream systems
✅ Automated Checks: No manual intervention required
✅ Complete Transparency: Full visibility into what passed and what failed
✅ All-or-Nothing: Either all data is valid and committed, or none is written

How Data Validation Works

The pipeline follows a strict validation process with six stages:

┌─────────────────────────────────────────────────────────┐
│                    Your Data Pipeline                   │
└─────────────────────────────────────────────────────────┘

Stage 1: EXTRACT
├─ Data is fetched from your source (API, Database, etc.)
├─ Raw data is stored temporarily
└─ Status: Data collected, not yet validated

                     ↓

Stage 2: PREPARE  
├─ Data is processed and prepared for validation
├─ Deduplication rules are applied
├─ Schema is checked for compatibility
└─ Status: Data ready for quality checks
         
                     ↓

Stage 3: VALIDATE ⚡ (CRITICAL CHECKPOINT)
├─ NULL Checks: Are required fields populated?
├─ Uniqueness Checks: Are primary keys unique?
├─ Freshness Checks: Is the data recent enough?
│
├─ ❌ ANY CHECK FAILS
│   ├─ Data is REJECTED
│   ├─ Pipeline stops immediately  
│   ├─ Error details are logged
│   ├─ Nothing is written to destination
│   └─ You receive failure notification
│
└─ ✅ ALL CHECKS PASS
    ├─ Data is approved for storage
    └─ Continue to Stage 4 ↓

Stage 4: COMMIT
├─ Validated data is stored in Delta tables
├─ Ingestion timestamp is added
└─ Status: Data safely stored

                     ↓

Stage 5: DESTINATION WRITE
├─ Data is exported to CSV format (always)
├─ Data is written to your database (if configured)
└─ Status: Data available in destination systems

                     ↓

Stage 6: FINALIZE
├─ Metadata and schemas are stored
├─ Test results summary is created
├─ Temporary files are cleaned up
└─ Status: Pipeline complete ✓

The Critical Validation Checkpoint (Stage 3)

Before validation runs:

Data has been extracted from your source
Data exists in temporary storage only
Nothing has been written to your destination yet

During validation:

Multiple quality checks are performed
Each check must pass for the pipeline to continue
Results are logged in real-time

After validation:

✅ If all checks pass: Data proceeds to commit and destination write
❌ If any check fails: Pipeline stops, no data is written anywhere

Data Quality Checks

The system performs three types of automated data quality checks:

1. NULL Checks (Required Fields Validation)

Purpose: Ensures that critical fields are populated and not empty.

What it checks:

All primary key columns must have values
Required fields specified in your configuration must not be NULL

Example scenarios:

Scenario	Result
All user IDs are populated	✅ PASS
5 out of 100 records missing email address	❌ FAIL
All required fields have values	✅ PASS
Primary key column contains NULL values	❌ FAIL

When this check fails:

You'll see: "NULL CHECK FAILED: Found X null values in column 'column_name'"
Common causes: Missing data at source, incomplete API responses, data extraction errors

2. Uniqueness Checks (Duplicate Detection)

Purpose: Ensures that primary keys are unique with no duplicates.

What it checks:

Primary key values must be unique across all records
Composite keys (multiple columns) must form unique combinations

Example scenarios:

Scenario	Result
100 users, 100 unique user IDs	✅ PASS
100 users, 95 unique user IDs (5 duplicates)	❌ FAIL
Composite key (user_id + order_id) all unique	✅ PASS
Same order ID appears twice for same user	❌ FAIL

When this check fails:

You'll see: "UNIQUENESS CHECK FAILED: Found X duplicate rows"
Common causes: Source system bugs, race conditions, duplicate API responses

3. Freshness Checks (Data Recency Validation)

Purpose: Ensures that data is recent and timely based on your pipeline schedule.

What it checks:

Data age compared to your pipeline frequency
Whether new data has arrived since the last run

Freshness Thresholds (automatically set based on your schedule):

Your Pipeline Runs	Warning After	Error (Reject) After
Every hour	2 hours	6 hours
Daily	25 hours	48 hours
Weekly	8 days	14 days
Monthly	32 days	62 days

Example scenarios:

Pipeline Schedule	Last Data Received	Result
Daily	1 hour ago	✅ PASS
Daily	30 hours ago	⚠️ WARNING (continues)
Daily	50 hours ago	❌ FAIL (rejects)
Hourly	7 hours ago	❌ FAIL (rejects)

When this check fails:

You'll see: "FRESHNESS CHECK FAILED: Data is X hours old (threshold: Y hours)"
Common causes: Source system delays, API downtime, network issues, upstream pipeline failures

Important: Warnings don't stop the pipeline - they're just notifications. Only errors stop the pipeline.

Understanding Validation Results

Test Summary Report

After each pipeline run, you'll receive a test summary showing all validation results:

======================================================
DATA QUALITY TEST SUMMARY
======================================================
Pipeline: customer_data_import
Execution ID: exec-20260111-143022
Status: PASSED ✓
======================================================

Total Tests: 6
Passed:      6
Failed:      0
Warnings:    0

Execution Time: 2.3 seconds
======================================================

MAIN TABLE: customers
  ✓ null_check_customer_id         PASSED
  ✓ null_check_email               PASSED
  ✓ uniqueness_check               PASSED
  ✓ freshness_check                PASSED

RELATED TABLE: customer_orders
  ✓ null_check_order_id            PASSED
  ✓ freshness_check                PASSED

======================================================
RESULT: All data quality checks passed
DATA COMMITTED: 1,247 records written successfully
======================================================

Understanding Test States

Symbol	State	Meaning	What Happens
✓	PASSED	Test passed successfully	Pipeline continues normally
⚠️	WARNING	Warning threshold exceeded	Pipeline continues, but you should investigate
✗	FAILED	Test failed	Pipeline stops, data is rejected

Configuring Data Quality Tests

Enabling Data Quality Checks

When setting up your pipeline, specify which checks to enable:

Option 1: Enable All Checks (Recommended)

{
  "data_quality_tests": ["null", "unique", "freshness"]
}

Option 2: Enable Specific Checks

{
  "data_quality_tests": ["null", "unique"]
}

Option 3: Disable All Checks (Not Recommended)

{
  "data_quality_tests": []
}

⚠️ Warning: Disabling data quality tests means invalid or duplicate data may reach your destination.

What Each Test Requires

Test Type	Required Configuration	Checks
null	Primary key fields must be defined	All primary key columns for NULL values
unique	Primary key fields must be defined	Primary key uniqueness across all records
freshness	Pipeline schedule (cron expression) must be set	Data age against automatic thresholds

Example Configuration

{
  "pipeline_name": "customer_data_sync",
  "table_name": "customers",
  "dedupe_keys_json_paths": ["customer_id", "email"],
  "cron": "0 0 * * *",
  "data_quality_tests": ["null", "unique", "freshness"]
}

This configuration will:

Check that customer_id and email are never NULL
Check that the combination of customer_id and email is unique
Check that data is less than 48 hours old (based on daily schedule)

What Happens When Validation Fails

Immediate Actions

When any validation check fails:

Pipeline Stops Immediately
- No further processing occurs
- Data is not written to any destination
- All temporary data is preserved for investigation
Detailed Error Logging
- Exact failure reason is recorded
- Number of affected records is logged
- Timestamp of failure is captured
Notification
- Pipeline status is set to FAILED
- Error details are available in logs
- Monitoring systems are alerted
Data Preservation
- Failed data remains in temporary staging area
- You can review the problematic records
- Source data is not affected

Example Failure Scenarios

Scenario 1: NULL Check Failure

What Happened:

NULL CHECK FAILED: Found 15 null values in column 'email'

What This Means:

15 customer records are missing email addresses
Data quality requirement: all customers must have emails
Pipeline rejected all 1,000 records (not just the 15 bad ones)

What You Should Do:

Check source system for missing emails
Verify API is returning complete data
Fix data quality at source
Re-run pipeline once source is corrected

Scenario 2: Uniqueness Check Failure

What Happened:

UNIQUENESS CHECK FAILED: Found 8 duplicate rows based on primary key 'customer_id'

What This Means:

8 customer IDs appear more than once in the incoming data
This could indicate duplicate records or improper deduplication
Pipeline rejected all records to prevent corruption

What You Should Do:

Investigate why source has duplicates
Check if source API is being called multiple times
Verify deduplication logic is correctly configured
Fix source or adjust deduplication strategy

Scenario 3: Freshness Check Failure

What Happened:

FRESHNESS CHECK FAILED: Data is 52 hours old (threshold: 48 hours for daily pipeline)

What This Means:

No new data has arrived in over 2 days
Your daily pipeline expects daily data
Pipeline rejected the stale data

What You Should Do:

Check if source system is operational
Verify source system is generating new data
Check for upstream pipeline failures
Investigate network connectivity issues
Consider if this is expected (e.g., weekend, holiday)

Data Recovery After Failures

Option 1: Fix and Re-run (Recommended)

Identify and fix the root cause
Re-run the pipeline
Fresh data will be extracted and validated
If validation passes, data will be committed

Option 2: Adjust Configuration

If validation rules are too strict, adjust thresholds
Update pipeline configuration
Re-run pipeline with new settings

Option 3: Temporary Bypass (Use with Caution)

Temporarily disable specific checks
Re-run pipeline
Re-enable checks after resolution

⚠️ Important: Bypassing validation checks should only be done when you're certain the data quality issue is acceptable or has been manually verified.

Best Practices

1. Always Enable Data Quality Tests in Production

✅ Do This:

Enable all three test types for critical data
Use NULL and uniqueness checks at minimum
Monitor test results regularly

❌ Don't Do This:

Disable validation to "make the pipeline work"
Ignore warning messages
Assume source data is always perfect

2. Set Realistic Freshness Thresholds

Your pipeline schedule should match your data frequency:

Data Updates	Recommended Schedule	Why
Real-time streaming	Every 15 minutes	Catch delays quickly
Daily batch exports	Daily	Match data availability
Weekly reports	Weekly	Appropriate for reporting cycles
Monthly archives	Monthly	Match archival schedule

3. Monitor Validation Trends

Track these metrics over time:

Number of validation failures per month
Most common failure types
Time to resolve validation issues
Data quality improvement trends

4. Document Expected Failures

Some failures may be expected:

Weekend gaps for business-day-only data
Holiday periods with no activity
Scheduled maintenance windows

Document these patterns to avoid false alarms.

5. Set Up Alerting

Configure alerts for:

❌ Any validation failure (high priority)
⚠️ Repeated warnings (medium priority)
ℹ️ Unusual data volumes (low priority)

Troubleshooting Common Issues

Problem: NULL Check Keeps Failing

Common Causes:

Source system has incomplete data
API authentication is partially failing
Database view/query is missing joins
Field mappings are incorrect

How to Diagnose:

Check recent changes to source system
Verify API credentials are valid
Test source query independently
Review field mapping configuration

Resolution:

Fix data quality at source (preferred)
Or adjust which fields are required in configuration

Problem: Uniqueness Check Failing with Duplicates

Common Causes:

Source has legitimate duplicates
Incorrect primary key definition
Race conditions in data extraction
Time-based data needs timestamp in key

How to Diagnose:

Query source system directly for duplicates
Review primary key definition - is it truly unique?
Check if multiple pipeline runs are happening simultaneously
Verify if records change over time (needs versioning)

Resolution:

Fix duplicates at source
Or add timestamp/version to primary key
Or use different deduplication strategy

Problem: Freshness Check Failing Repeatedly

Common Causes:

Source system is not producing new data
Pipeline schedule doesn't match data frequency
Upstream data pipeline is failing
Time zone mismatches

How to Diagnose:

Verify source system is operational
Check when last data was actually created (not extracted)
Review upstream pipeline status
Compare timestamps in different time zones

Resolution:

Fix upstream data generation issues
Or adjust pipeline schedule to match actual data frequency
Or correct time zone configurations

Problem: Warning Messages Keep Appearing

What Warnings Mean:

Data quality is borderline but acceptable
Not severe enough to fail, but needs attention
Trend may lead to failures soon

What To Do:

Don't ignore warnings - investigate root cause
Monitor if warnings are becoming more frequent
Address issues before they become failures
Consider if thresholds need adjustment

Problem: All Tests Pass But Data Looks Wrong

Possible Issues:

Tests are configured but too lenient
Missing additional validation rules
Business logic issues not caught by quality checks

What To Do:

Review test configuration
Add more specific validation rules
Consider custom business rule checks
Implement downstream data validation

Getting Help

If you encounter validation issues you can't resolve:

Check the logs: Full error details are always logged
Review test summary: Shows exactly which test failed and why
Examine sample data: Look at the actual records that failed
Check documentation: Review configuration requirements
Contact support: Provide execution ID and error details

Information to Provide When Reporting Issues

Pipeline name and configuration
Execution ID (from test summary)
Complete error message
When the issue started
Any recent changes to pipeline or source system
Sample of affected data (anonymized if sensitive)

Summary

Key Takeaways

✅ Data is validated before being written to destinations
✅ Failed validation = zero data committed (all-or-nothing)
✅ Three check types: NULL, Uniqueness, Freshness
✅ Configure tests based on your data quality requirements
✅ Monitor trends and address warnings proactively
✅ Fix issues at source whenever possible

The Validation Promise

When you see "Pipeline Status: PASSED" with all validation checks marked ✓, you can be confident that:

All required fields are populated
All primary keys are unique
All data is fresh and timely
All records meet quality standards
All data has been safely committed
All data is available in your destination

This guarantee enables you to trust your data and make confident business decisions.

Additional Resources

Pipeline Configuration Guide: How to set up and configure pipelines
Monitoring Dashboard: View real-time validation results
API Documentation: Programmatic access to validation results
Support Portal: Get help with validation issues

Data Validation and Quality Control Guide

Table of Contents​

Overview​

Key Benefits​

How Data Validation Works​

The Critical Validation Checkpoint (Stage 3)​

Data Quality Checks​

1. NULL Checks (Required Fields Validation)​

2. Uniqueness Checks (Duplicate Detection)​

3. Freshness Checks (Data Recency Validation)​

Understanding Validation Results​

Test Summary Report​

Understanding Test States​

Configuring Data Quality Tests​

Enabling Data Quality Checks​

What Each Test Requires​

Example Configuration​

What Happens When Validation Fails​

Immediate Actions​

Example Failure Scenarios​

Scenario 1: NULL Check Failure​

Scenario 2: Uniqueness Check Failure​

Scenario 3: Freshness Check Failure​

Data Recovery After Failures​

Best Practices​

1. Always Enable Data Quality Tests in Production​

2. Set Realistic Freshness Thresholds​

3. Monitor Validation Trends​

4. Document Expected Failures​

5. Set Up Alerting​

Troubleshooting Common Issues​

Problem: NULL Check Keeps Failing​

Problem: Uniqueness Check Failing with Duplicates​

Problem: Freshness Check Failing Repeatedly​

Problem: Warning Messages Keep Appearing​

Problem: All Tests Pass But Data Looks Wrong​

Getting Help​

Information to Provide When Reporting Issues​

Summary​

Key Takeaways​

The Validation Promise​

Additional Resources​

Table of Contents

Overview

Key Benefits

How Data Validation Works

The Critical Validation Checkpoint (Stage 3)

Data Quality Checks

1. NULL Checks (Required Fields Validation)

2. Uniqueness Checks (Duplicate Detection)

3. Freshness Checks (Data Recency Validation)

Understanding Validation Results

Test Summary Report

Understanding Test States

Configuring Data Quality Tests

Enabling Data Quality Checks

What Each Test Requires

Example Configuration

What Happens When Validation Fails

Immediate Actions

Example Failure Scenarios

Scenario 1: NULL Check Failure

Scenario 2: Uniqueness Check Failure

Scenario 3: Freshness Check Failure

Data Recovery After Failures

Best Practices

1. Always Enable Data Quality Tests in Production

2. Set Realistic Freshness Thresholds

3. Monitor Validation Trends

4. Document Expected Failures

5. Set Up Alerting

Troubleshooting Common Issues

Problem: NULL Check Keeps Failing

Problem: Uniqueness Check Failing with Duplicates

Problem: Freshness Check Failing Repeatedly

Problem: Warning Messages Keep Appearing

Problem: All Tests Pass But Data Looks Wrong

Getting Help

Information to Provide When Reporting Issues

Summary

Key Takeaways

The Validation Promise

Additional Resources