Concepts
Source
Source is the cloud application from which data is ingested. A source may make its data available via API endpoints, emailed CSV files, SQL databases, or files in SFTP folders. Examples: NetSuite, HubSpot, JIRA, BambooHR.
Destination
Destination is the data store where ingested data is written. By default, DataStori writes output data to the customer's cloud storage - AWS S3, Azure Blob or Google Cloud Storage. DataStori can further write the data (all of it or a selected subset) to SQL databases in the customer's cloud, such as PostgreSQL, MySQL, Azure SQL, or any SQL Alchemy supported database.
Integration
A connection between a source and a destination is an integration, for example NetSuite -> Azure SQL.
Data Pipeline
A data pipeline is a component to copy specific data from source to destination using an integration. For example, a data pipeline can be built to copy the GL Details table from NetSuite to Azure SQL using the NetSuite -> Azure SQL integration.
Components and Functions of a Data Pipeline
Data Deduplication
While ingesting data from the source, DataStori ensures that no data is duplicated. To ensure this, DataStori implements the following dedupe strategies:
- Full Refresh - Sync all records from the source, and replace data in the destination by overwriting it.
- Full Refresh And Append - Sync all records from the source, and add them to the destination without deleting any data.
- Incremental Dedupe - Sync new records from the source, and add them to the destination without deleting any data.
- Incremental Drop and Load - Sync new records from the source where unique keys are not available.
Unique Keys
The columns in a table which define data to be a unique record. For example, employee_id is a unique key in an 'Employee' table.
Unique keys are used to build relationships within a dataset. For example, employee_id is used to join the Employee table to the Payroll table to obtain all payroll transactions for each employee.
Sort Keys
The columns in a table which identify the latest record, when two or more records have the same unique key. For example, last_modified_date column is used to dedupe two records having same employee_id and retain only the latest record.
Sort keys show data recency, and can be used as the basis for further action on a record, such as dedupe or delete.
Pipeline Variables
The dynamic input to be passed to data pipelines. For example, to run a pipeline with yesterday's date as an input, define a pipeline variable having the following value:
(datetime.date.today() - datetime.timedelta(days=1)).strftime('%m/%d/%Y')
Pipeline variables are Python functions. Dates are common variables, and the datetime library can be used to select the required parameter.
Pipeline Schedule
The schedule on which a data pipeline is set up to run. Pipelines can be run schedule or on-demand.
Pipeline schedules are time zone and daylight savings aware, and adjust automatically when they change twice a year.
Data Flattening / Normalization
DataStori flattens the ingested data. If the response is a JSON, it is converted into a flat table to make it easy to consume for downstream analytics and reporting.
Data Schema Management
DataStori automatically manages the schema of the data. If a new field is added to an API response or CSV file, DataStori automatically adds it in the output file and SQL table. Schema changes are tracked in DataStori, and capture the history of schema evolution in the source table.
Data Backload / Historical Load
DataStori supports the load of historical data from APIs. In a single API call, users can define the data to be backloaded. DataStori splits and runs it as multiple loads, to manage API rate limits and optimize throughput and performance. For example, a user may ask for two years of past data, and DataStori could execute the request in 24 batches of one month each.
The amount of data that can be loaded in a single API call is defined and limited by the source application. For example, an API call cannot return more than 50,000 records, or the API response size can't be more than 5 MB. DataStori automatically factors in these constraints to split the data backload request into multiple executions.
Data Quality
DataStori runs automated test cases on input data quality, and raises alerts if they fail. DataStori runs the following test cases on all incoming data:
- Uniqueness test: Check if all the primary keys are unique.
- Not null test: Check if primary keys contain a null value, and if so, fail the pipeline and raise an alert.
- Data freshness test: Check if the data is getting refreshed on schedule, or if it is getting stale because of a pipeline run failure.
Data Documentation
DataStori creates automated documentation for all pipeline schedules and executions, and keeps it in sync as they change over time.
Data Rollback
DataStori has a provision to roll back to a previous state of data. Users can track how the data has evolved, and restore it to a prior state if new data has an error that impacts any downstream processes.
Pipeline Retries and Failure Notifications
In the event of a pipeline execution failure, DataStori makes three attempts to rerun it. If it still fails, the pipeline is marked as failed and the user is notified with an error code and reason.
A failed pipeline (after three retries) has to be run manually by the user.
Pipeline Concurrency
The number of concurrent runs of a single data pipeline. By default, pipeline concurrency is set to 1 to avoid any race conditions.