Data schema evolution in Extract, Load, Transform (ELT) pipelines is a critical aspect of managing data as it grows and changes over time. Source systems and business requirements are dynamic leading to schema updates. Consider the following:
Source systems - Assume your source application added or removed a field from the API. This shouldn't cause your pipelines to come to a stall.
Business requirements - Your business wants to report on new KPIs and requests you to add a few more columns to the data pipelines.
Both the above scenarios are extremely common during the lifecycly of data pipliens and handling schema updates is necessary for operational continuity and data integrity.
Schema evolution in DataStori
DataStori automatically evolves your schema as the data or the business requirements change. Consider the following scenarios:
1. Adding new columns:
DataStori adds the new columns to your dataset. For the existing rows, the data is set to null.
- Consider the following original data:
order_id | customer_id | order_date | total_amount |
---|---|---|---|
1 | 101 | 2024-07-01 | 100.00 |
2 | 102 | 2024-07-02 | 150.00 |
- New data:
order_id | customer_id | order_date | customer_email | total_amount |
---|---|---|---|---|
3 | 103 | 2024-07-03 | unknown@example.com | 200.00 |
- Final dataset: New column added from the new data, col value for old data set to null.
order_id | customer_id | order_date | customer_email | total_amount |
---|---|---|---|---|
1 | 101 | 2024-07-01 | null | 100.00 |
2 | 102 | 2024-07-02 | null | 150.00 |
3 | 103 | 2024-07-03 | unknown@example.com | 200.00 |
2. Dropping columns
DataStori will not drop the column from your dataset but will set the value to null for the rows where the column is missing.
- Consider the following original data:
order_id | customer_id | order_date | customer_email | total_amount |
---|---|---|---|---|
1 | 101 | 2024-07-01 | unknown@example.com | 100.00 |
2 | 102 | 2024-07-02 | unknown@example.com | 150.00 |
- New data drops "total_amount" column
order_id | customer_id | order_date | customer_email |
---|---|---|---|
3 | 103 | 2024-07-03 | unknown@example.com |
4 | 104 | 2024-07-04 | customer@domain.com |
- Final dataset
order_id | customer_id | order_date | customer_email | total_amount |
---|---|---|---|---|
1 | 101 | 2024-07-01 | unknown@example.com | 100.00 |
2 | 102 | 2024-07-02 | unknown@example.com | 150.00 |
3 | 103 | 2024-07-03 | unknown@example.com | null |
4 | 104 | 2024-07-04 | customer@domain.com | null |
3. Renaming Columns
DataStori will handle the renaming of the columns in the following method:
- Consider the following original data:
order_id | customer_id | order_date | customer_email | total_amount |
---|---|---|---|---|
1 | 101 | 2024-07-01 | unknown@example.com | 100.00 |
2 | 102 | 2024-07-02 | unknown@example.com | 150.00 |
- New data renames "total_amount" to "sales_amount" column
order_id | customer_id | order_date | customer_email | sales_amount |
---|---|---|---|---|
3 | 103 | 2024-07-03 | unknown@example.com | 200.00 |
4 | 104 | 2024-07-04 | customer@domain.com | 250.00 |
- Final dataset
order_id | customer_id | order_date | customer_email | total_amount | sales_amount |
---|---|---|---|---|---|
1 | 101 | 2024-07-01 | unknown@example.com | 100.00 | null |
2 | 102 | 2024-07-02 | unknown@example.com | 150.00 | null |
3 | 103 | 2024-07-03 | unknown@example.com | null | 200.00 |
4 | 104 | 2024-07-04 | customer@domain.com | null | 250.00 |
4. Changing Data Types
! DataStori will raise an error and stop the pipeline in case of changing data types.
Changing data types can cause further downstream chaos and we require manual intervention to resolve the changing data types.
Documenting the schema evolution
Version controlled schema: DataStori automatically documents each schema update and you can view how the schema has evolved over a period of time.
Rollback schema changes: In addition, if you want to roll back to a previous version of the data (prior to schema evolution), DataStori enables you to do that.
Monitoring schema changes: DataStori alerts you in case of schema updates.