Data Deduplication
When DataStori fetches and loads source data, it carries out the following transformations:
- Deduplicates data - It eliminates duplicate elements of data and retains a single copy of a record.
- Flattens data - The API response may be a nested or complex JSON, which DataStori flattens to make it easy for users to consume.
DataStori supports the following data deduplication strategies:
- Full Refresh
- Full Refresh and Append
- Incremental Load
- Incremental Drop and Load
Based on the data and business needs, users can select a dedupe strategy for each API or table (i.e., for each pipeline). The strategies are detailed below.
Full Refresh
This is a data synchronization method where the entire dataset is refreshed or reloaded from the source system into the destination system in every pipeline run.
This strategy is often used when source data changes infrequently or when it is more efficient to reload the entire dataset rather than tracking individual changes.
Benefits
Simplicity: Since the entire dataset is replaced in every pipeline run, it is easy to manage and troubleshoot data synchronization.
Data Consistency: The reporting database is guaranteed to be consistent with the source data after each refresh, eliminating the need to track and apply incremental updates.
Considerations
Data Volume: Full refreshes can be resource intensive, so it is essential to consider the volume of data being processed and the impact on system resources.
Rate of Change of Source Data: This strategy is most suitable for data sources that change infrequently, as frequent full refreshes can lead to unnecessary resource usage.
Historical Data: Users need to decide whether to retain historical data in the reporting database or to replace it with each refresh.
Example
- [Initial dataset] Consider the following data present in the database before a full refresh deduplication.
1. Name: John Doe, Email: john@example.com, Phone: 123-456-7890
2. Name: Jane Smith, Email: jane@example.com, Phone: 987-654-3210
3. Name: Alice Johnson, Email: alice@example.com, Phone: 555-555-5555
4. Name: John Doe, Email: john@example.com, Phone: 123-456-7890 (Duplicate)
5. Name: Bob Winston, Email: bob@example.com, Phone: 111-222-3333
- [New data loaded] This is the incoming data:
1. Name: John Doe, Email: john@example.com, Phone: 123-456-7890
2. Name: Jane Smith, Email: jane@example.com, Phone: 987-654-3210
3. Name: Alice Johnson, Email: alice@example.com, Phone: 555-555-5555
4. Name: Bob Winston, Email: bob@example.com, Phone: 111-222-3333
- [Final dataset] Final state of the database after the full refresh deduplication.
1. Name: John Doe, Email: john@example.com, Phone: 123-456-7890
2. Name: Jane Smith, Email: jane@example.com, Phone: 987-654-3210
3. Name: Alice Johnson, Email: alice@example.com, Phone: 555-555-5555
4. Name: Bob Winston, Email: bob@example.com, Phone: 111-222-3333
Full Refresh and Append
Data load follows the same strategy as Full Refresh.
The main difference between a Full Refresh and a Full Refresh and Append is in how pre-existing data is handled. Full refresh replaces the entire dataset, while full refresh with append retains existing data and appends the new data. The choice between these two depends on the user's data management needs and whether historical data needs to be retained in the dataset.
Example
- [Initial dataset] Consider the following data present in the database before a full refresh and append deduplication.
1. Name: John Doe, Email: john@example.com, Phone: 123-456-7890, ingestion_date: 2022-01-01
2. Name: Jane Smith, Email: jane@example.com, Phone: 987-654-3210, ingestion_date: 2022-01-01
3. Name: Alice Johnson, Email: alice@example.com, Phone: 555-555-5555, ingestion_date: 2022-01-01
4. Name: John Doe, Email: john@example.com, Phone: 123-456-7890 (Duplicate), ingestion_date: 2022-01-01
5. Name: Bob Winston, Email: bob@example.com, Phone: 111-222-3333, ingestion_date: 2022-01-01
- [New data loaded] This is the incoming data:
1. Name: John Doe, Email: john@example.com, Phone: 123-456-7890, ingestion_date: 2022-01-02
2. Name: Jane Smith, Email: jane@example.com, Phone: 987-654-3210, ingestion_date: 2022-01-02
3. Name: Alice Johnson, Email: alice@example.com, Phone: 555-555-5555, ingestion_date: 2022-01-02
4. Name: Bob Winston, Email: bob@example.com, Phone: 111-222-3333, ingestion_date: 2022-01-02
- [Final dataset] Final state of the database after the full refresh and append deduplication.
1. Name: John Doe, Email: john@example.com, Phone: 123-456-7890, ingestion_date: 2022-01-01
2. Name: Jane Smith, Email: jane@example.com, Phone: 987-654-3210, ingestion_date: 2022-01-01
3. Name: Alice Johnson, Email: alice@example.com, Phone: 555-555-5555, ingestion_date: 2022-01-01
4. Name: John Doe, Email: john@example.com, Phone: 123-456-7890 (Duplicate), ingestion_date: 2022-01-01
5. Name: Bob Winston, Email: bob@example.com, Phone: 111-222-3333, ingestion_date: 2022-01-01
6. Name: John Doe, Email: john@example.com, Phone: 123-456-7890, ingestion_date: 2022-01-02
7. Name: Jane Smith, Email: jane@example.com, Phone: 987-654-3210, ingestion_date: 2022-01-02
8. Name: Alice Johnson, Email: alice@example.com, Phone: 555-555-5555, ingestion_date: 2022-01-02
9. Name: Bob Winston, Email: bob@example.com, Phone: 111-222-3333, ingestion_date: 2022-01-02
Incremental Load
Incremental loading is a method to identify and eliminates duplicate records from a dataset by focusing only on the new and changed data that has been added since the last deduplication (or pipeline run). This approach minimizes processing overhead and execution time.
To run an incremental dedupe, the following fields are required:
- Unique Key - In order to identify if a record is new or existing, i.e., whether a record is inserted or updated in the database.
- Sort Key / Cursor - If we get a duplicate record, sort key helps us identify its latest version.
Example
Suppose you maintain a set of customer records, and you want to add new customers. The initial dataset (with unique records) is as follows:
- Customer ID: 1, Name: John Smith, Created Date: 2023-01-15, Last Modified Date: 2023-02-20
- Customer ID: 2, Name: Jane Doe, Created Date: 2023-02-10, Last Modified Date: 2023-03-25
- Customer ID: 3, Name: Alice Johnson, Created Date: 2023-03-05, Last Modified Date: 2023-04-15
Over time, new customers are added and existing ones updated as follows:
- New Customer: Bob Winston is added with Created Date as 2023-04-01.
- Existing Customer Update: John Smith has some updated information, and the Last Modified Date is changed to 2023-04-05.
- New Customer: David Lee is added with Created Date as 2023-04-10.
- Customer ID: 4, Name: Bob Winston, Created Date: 2023-04-01, Last Modified Date: 2023-04-01
- Customer ID: 1, Name: John Smith, Created Date: 2023-01-15, Last Modified Date: 2023-04-05
- Customer ID: 5, Name: David Lee, Created Date: 2023-04-10, Last Modified Date: 2023-04-10
Perform an incremental deduplication on Customer ID and Last Modified Date. For each new or updated record in the incoming dataset, compare it to all pre-existing records, and:
- If there is no pre-existing record with the same Customer ID (4, 5), then insert the record in the dataset.
- If a record with the same Customer ID (1) is present from before, then use the Last Modified Date to find the most recent record and update the record with Customer ID = 1 in the dataset.
The final dataset after dedupe is as follows:
- Customer ID: 2, Name: Jane Doe, Created Date: 2023-02-10, Last Modified Date: 2023-03-25 (Retained from the last fetch)
- Customer ID: 3, Name: Alice Johnson, Created Date: 2023-03-05, Last Modified Date: 2023-04-15 (Retained from the last fetch)
- Customer ID: 4, Name: Bob Winston, Created Date: 2023-04-01, Last Modified Date: 2023-04-01 (Inserted record)
- Customer ID: 1, Name: John Smith, Created Date: 2023-01-15, Last Modified Date: 2023-04-05 (Updated record)
- Customer ID: 5, Name: David Lee, Created Date: 2023-04-01, Last Modified Date: 2023-04-10 (Inserted record)
Deleted Records
The deletes in incremental mode can be handled only by using Soft deleted approach, i.e. the rows have a marker indicating that the record has been deleted and is not in use anymore.
- Customer ID: 1, Name: John Smith, Status: Active
- Customer ID: 2, Name: Jane Doe, Status: Deleted
- Customer ID: 3, Name: Alice Johnson, Status: Active
Hard Delete - If a record is permanently deleted from the source system, then it will not appear in the next data load, but it will continue to reside in the database. To account for hard deletes, occasionally run a FULL REFRESH of the dataset or consider an Incremental Drop and Load.
Incremental Drop and Load
Consider the following data scenarios:
- Records are being deleted from your dataset.
- You don't have a unique key or a sort key in your dataset to identify the latest records.
Incremental drop and load works in the following way:
- Create a sort key by combining columns to give a unique value.
- Pick up the sort key from the incoming data, and drop all the rows with it in the existing dataset.
- Append the new dataset to the existing dataset.