Data Validation and Quality Control Guide
This guide explains how data quality validation works in the Pipeline Generator Ingestion system and what happens when data passes or fails validation checks.
Table of Contents
- Overview
- How Data Validation Works
- Data Quality Checks
- Understanding Validation Results
- Configuring Data Quality Tests
- What Happens When Validation Fails
- Best Practices
- Troubleshooting Common Issues
Overview
The Pipeline Generator Ingestion system ensures data quality by validating all incoming data before it is stored in your destination. This "validate-before-commit" approach prevents bad data from entering your data warehouse or analytics systems.
Key Benefits
✅ Data Quality Guarantee: Only validated data reaches your destination
✅ Early Detection: Problems are caught before they affect downstream systems
✅ Automated Checks: No manual intervention required
✅ Complete Transparency: Full visibility into what passed and what failed
✅ All-or-Nothing: Either all data is valid and committed, or none is written
How Data Validation Works
The pipeline follows a strict validation process with six stages:
┌─────────────────────────────────────────────────────────┐
│ Your Data Pipeline │
└─────────────────────────────────────────────────────────┘
Stage 1: EXTRACT
├─ Data is fetched from your source (API, Database, etc.)
├─ Raw data is stored temporarily
└─ Status: Data collected, not yet validated
↓
Stage 2: PREPARE
├─ Data is processed and prepared for validation
├─ Deduplication rules are applied
├─ Schema is checked for compatibility
└─ Status: Data ready for quality checks
↓
Stage 3: VALIDATE ⚡ (CRITICAL CHECKPOINT)
├─ NULL Checks: Are required fields populated?
├─ Uniqueness Checks: Are primary keys unique?
├─ Freshness Checks: Is the data recent enough?
│
├─ ❌ ANY CHECK FAILS
│ ├─ Data is REJECTED
│ ├─ Pipeline stops immediately
│ ├─ Error details are logged
│ ├─ Nothing is written to destination
│ └─ You receive failure notification
│
└─ ✅ ALL CHECKS PASS
├─ Data is approved for storage
└─ Continue to Stage 4 ↓
Stage 4: COMMIT
├─ Validated data is stored in Delta tables
├─ Ingestion timestamp is added
└─ Status: Data safely stored
↓
Stage 5: DESTINATION WRITE
├─ Data is exported to CSV format (always)
├─ Data is written to your database (if configured)
└─ Status: Data available in destination systems
↓
Stage 6: FINALIZE
├─ Metadata and schemas are stored
├─ Test results summary is created
├─ Temporary files are cleaned up
└─ Status: Pipeline complete ✓
The Critical Validation Checkpoint (Stage 3)
Before validation runs:
- Data has been extracted from your source
- Data exists in temporary storage only
- Nothing has been written to your destination yet
During validation:
- Multiple quality checks are performed
- Each check must pass for the pipeline to continue
- Results are logged in real-time
After validation:
- ✅ If all checks pass: Data proceeds to commit and destination write
- ❌ If any check fails: Pipeline stops, no data is written anywhere
Data Quality Checks
The system performs three types of automated data quality checks:
1. NULL Checks (Required Fields Validation)
Purpose: Ensures that critical fields are populated and not empty.
What it checks:
- All primary key columns must have values
- Required fields specified in your configuration must not be NULL
Example scenarios:
| Scenario | Result |
|---|---|
| All user IDs are populated | ✅ PASS |
| 5 out of 100 records missing email address | ❌ FAIL |
| All required fields have values | ✅ PASS |
| Primary key column contains NULL values | ❌ FAIL |
When this check fails:
- You'll see: "NULL CHECK FAILED: Found X null values in column 'column_name'"
- Common causes: Missing data at source, incomplete API responses, data extraction errors
2. Uniqueness Checks (Duplicate Detection)
Purpose: Ensures that primary keys are unique with no duplicates.
What it checks:
- Primary key values must be unique across all records
- Composite keys (multiple columns) must form unique combinations
Example scenarios:
| Scenario | Result |
|---|---|
| 100 users, 100 unique user IDs | ✅ PASS |
| 100 users, 95 unique user IDs (5 duplicates) | ❌ FAIL |
| Composite key (user_id + order_id) all unique | ✅ PASS |
| Same order ID appears twice for same user | ❌ FAIL |
When this check fails:
- You'll see: "UNIQUENESS CHECK FAILED: Found X duplicate rows"
- Common causes: Source system bugs, race conditions, duplicate API responses
3. Freshness Checks (Data Recency Validation)
Purpose: Ensures that data is recent and timely based on your pipeline schedule.
What it checks:
- Data age compared to your pipeline frequency
- Whether new data has arrived since the last run
Freshness Thresholds (automatically set based on your schedule):
| Your Pipeline Runs | Warning After | Error (Reject) After |
|---|---|---|
| Every hour | 2 hours | 6 hours |
| Daily | 25 hours | 48 hours |
| Weekly | 8 days | 14 days |
| Monthly | 32 days | 62 days |
Example scenarios:
| Pipeline Schedule | Last Data Received | Result |
|---|---|---|
| Daily | 1 hour ago | ✅ PASS |
| Daily | 30 hours ago | ⚠️ WARNING (continues) |
| Daily | 50 hours ago | ❌ FAIL (rejects) |
| Hourly | 7 hours ago | ❌ FAIL (rejects) |
When this check fails:
- You'll see: "FRESHNESS CHECK FAILED: Data is X hours old (threshold: Y hours)"
- Common causes: Source system delays, API downtime, network issues, upstream pipeline failures
Important: Warnings don't stop the pipeline - they're just notifications. Only errors stop the pipeline.
Understanding Validation Results
Test Summary Report
After each pipeline run, you'll receive a test summary showing all validation results:
======================================================
DATA QUALITY TEST SUMMARY
======================================================
Pipeline: customer_data_import
Execution ID: exec-20260111-143022
Status: PASSED ✓
======================================================
Total Tests: 6
Passed: 6
Failed: 0
Warnings: 0
Execution Time: 2.3 seconds
======================================================
MAIN TABLE: customers
✓ null_check_customer_id PASSED
✓ null_check_email PASSED
✓ uniqueness_check PASSED
✓ freshness_check PASSED
RELATED TABLE: customer_orders
✓ null_check_order_id PASSED
✓ freshness_check PASSED
======================================================
RESULT: All data quality checks passed
DATA COMMITTED: 1,247 records written successfully
======================================================
Understanding Test States
| Symbol | State | Meaning | What Happens |
|---|---|---|---|
| ✓ | PASSED | Test passed successfully | Pipeline continues normally |
| ⚠️ | WARNING | Warning threshold exceeded | Pipeline continues, but you should investigate |
| ✗ | FAILED | Test failed | Pipeline stops, data is rejected |
Configuring Data Quality Tests
Enabling Data Quality Checks
When setting up your pipeline, specify which checks to enable:
Option 1: Enable All Checks (Recommended)
{
"data_quality_tests": ["null", "unique", "freshness"]
}
Option 2: Enable Specific Checks
{
"data_quality_tests": ["null", "unique"]
}
Option 3: Disable All Checks (Not Recommended)
{
"data_quality_tests": []
}
⚠️ Warning: Disabling data quality tests means invalid or duplicate data may reach your destination.
What Each Test Requires
| Test Type | Required Configuration | Checks |
|---|---|---|
| null | Primary key fields must be defined | All primary key columns for NULL values |
| unique | Primary key fields must be defined | Primary key uniqueness across all records |
| freshness | Pipeline schedule (cron expression) must be set | Data age against automatic thresholds |
Example Configuration
{
"pipeline_name": "customer_data_sync",
"table_name": "customers",
"dedupe_keys_json_paths": ["customer_id", "email"],
"cron": "0 0 * * *",
"data_quality_tests": ["null", "unique", "freshness"]
}
This configuration will:
- Check that
customer_idandemailare never NULL - Check that the combination of
customer_idandemailis unique - Check that data is less than 48 hours old (based on daily schedule)
What Happens When Validation Fails
Immediate Actions
When any validation check fails:
Pipeline Stops Immediately
- No further processing occurs
- Data is not written to any destination
- All temporary data is preserved for investigation
Detailed Error Logging
- Exact failure reason is recorded
- Number of affected records is logged
- Timestamp of failure is captured
Notification
- Pipeline status is set to FAILED
- Error details are available in logs
- Monitoring systems are alerted
Data Preservation
- Failed data remains in temporary staging area
- You can review the problematic records
- Source data is not affected
Example Failure Scenarios
Scenario 1: NULL Check Failure
What Happened:
NULL CHECK FAILED: Found 15 null values in column 'email'
What This Means:
- 15 customer records are missing email addresses
- Data quality requirement: all customers must have emails
- Pipeline rejected all 1,000 records (not just the 15 bad ones)
What You Should Do:
- Check source system for missing emails
- Verify API is returning complete data
- Fix data quality at source
- Re-run pipeline once source is corrected
Scenario 2: Uniqueness Check Failure
What Happened:
UNIQUENESS CHECK FAILED: Found 8 duplicate rows based on primary key 'customer_id'
What This Means:
- 8 customer IDs appear more than once in the incoming data
- This could indicate duplicate records or improper deduplication
- Pipeline rejected all records to prevent corruption
What You Should Do:
- Investigate why source has duplicates
- Check if source API is being called multiple times
- Verify deduplication logic is correctly configured
- Fix source or adjust deduplication strategy
Scenario 3: Freshness Check Failure
What Happened:
FRESHNESS CHECK FAILED: Data is 52 hours old (threshold: 48 hours for daily pipeline)
What This Means:
- No new data has arrived in over 2 days
- Your daily pipeline expects daily data
- Pipeline rejected the stale data
What You Should Do:
- Check if source system is operational
- Verify source system is generating new data
- Check for upstream pipeline failures
- Investigate network connectivity issues
- Consider if this is expected (e.g., weekend, holiday)
Data Recovery After Failures
Option 1: Fix and Re-run (Recommended)
- Identify and fix the root cause
- Re-run the pipeline
- Fresh data will be extracted and validated
- If validation passes, data will be committed
Option 2: Adjust Configuration
- If validation rules are too strict, adjust thresholds
- Update pipeline configuration
- Re-run pipeline with new settings
Option 3: Temporary Bypass (Use with Caution)
- Temporarily disable specific checks
- Re-run pipeline
- Re-enable checks after resolution
⚠️ Important: Bypassing validation checks should only be done when you're certain the data quality issue is acceptable or has been manually verified.
Best Practices
1. Always Enable Data Quality Tests in Production
✅ Do This:
- Enable all three test types for critical data
- Use NULL and uniqueness checks at minimum
- Monitor test results regularly
❌ Don't Do This:
- Disable validation to "make the pipeline work"
- Ignore warning messages
- Assume source data is always perfect
2. Set Realistic Freshness Thresholds
Your pipeline schedule should match your data frequency:
| Data Updates | Recommended Schedule | Why |
|---|---|---|
| Real-time streaming | Every 15 minutes | Catch delays quickly |
| Daily batch exports | Daily | Match data availability |
| Weekly reports | Weekly | Appropriate for reporting cycles |
| Monthly archives | Monthly | Match archival schedule |
3. Monitor Validation Trends
Track these metrics over time:
- Number of validation failures per month
- Most common failure types
- Time to resolve validation issues
- Data quality improvement trends
4. Document Expected Failures
Some failures may be expected:
- Weekend gaps for business-day-only data
- Holiday periods with no activity
- Scheduled maintenance windows
Document these patterns to avoid false alarms.
5. Set Up Alerting
Configure alerts for:
- ❌ Any validation failure (high priority)
- ⚠️ Repeated warnings (medium priority)
- ℹ️ Unusual data volumes (low priority)
Troubleshooting Common Issues
Problem: NULL Check Keeps Failing
Common Causes:
- Source system has incomplete data
- API authentication is partially failing
- Database view/query is missing joins
- Field mappings are incorrect
How to Diagnose:
- Check recent changes to source system
- Verify API credentials are valid
- Test source query independently
- Review field mapping configuration
Resolution:
- Fix data quality at source (preferred)
- Or adjust which fields are required in configuration
Problem: Uniqueness Check Failing with Duplicates
Common Causes:
- Source has legitimate duplicates
- Incorrect primary key definition
- Race conditions in data extraction
- Time-based data needs timestamp in key
How to Diagnose:
- Query source system directly for duplicates
- Review primary key definition - is it truly unique?
- Check if multiple pipeline runs are happening simultaneously
- Verify if records change over time (needs versioning)
Resolution:
- Fix duplicates at source
- Or add timestamp/version to primary key
- Or use different deduplication strategy
Problem: Freshness Check Failing Repeatedly
Common Causes:
- Source system is not producing new data
- Pipeline schedule doesn't match data frequency
- Upstream data pipeline is failing
- Time zone mismatches
How to Diagnose:
- Verify source system is operational
- Check when last data was actually created (not extracted)
- Review upstream pipeline status
- Compare timestamps in different time zones
Resolution:
- Fix upstream data generation issues
- Or adjust pipeline schedule to match actual data frequency
- Or correct time zone configurations
Problem: Warning Messages Keep Appearing
What Warnings Mean:
- Data quality is borderline but acceptable
- Not severe enough to fail, but needs attention
- Trend may lead to failures soon
What To Do:
- Don't ignore warnings - investigate root cause
- Monitor if warnings are becoming more frequent
- Address issues before they become failures
- Consider if thresholds need adjustment
Problem: All Tests Pass But Data Looks Wrong
Possible Issues:
- Tests are configured but too lenient
- Missing additional validation rules
- Business logic issues not caught by quality checks
What To Do:
- Review test configuration
- Add more specific validation rules
- Consider custom business rule checks
- Implement downstream data validation
Getting Help
If you encounter validation issues you can't resolve:
- Check the logs: Full error details are always logged
- Review test summary: Shows exactly which test failed and why
- Examine sample data: Look at the actual records that failed
- Check documentation: Review configuration requirements
- Contact support: Provide execution ID and error details
Information to Provide When Reporting Issues
- Pipeline name and configuration
- Execution ID (from test summary)
- Complete error message
- When the issue started
- Any recent changes to pipeline or source system
- Sample of affected data (anonymized if sensitive)
Summary
Key Takeaways
✅ Data is validated before being written to destinations
✅ Failed validation = zero data committed (all-or-nothing)
✅ Three check types: NULL, Uniqueness, Freshness
✅ Configure tests based on your data quality requirements
✅ Monitor trends and address warnings proactively
✅ Fix issues at source whenever possible
The Validation Promise
When you see "Pipeline Status: PASSED" with all validation checks marked ✓, you can be confident that:
- All required fields are populated
- All primary keys are unique
- All data is fresh and timely
- All records meet quality standards
- All data has been safely committed
- All data is available in your destination
This guarantee enables you to trust your data and make confident business decisions.
Additional Resources
- Pipeline Configuration Guide: How to set up and configure pipelines
- Monitoring Dashboard: View real-time validation results
- API Documentation: Programmatic access to validation results
- Support Portal: Get help with validation issues