Scheduling and Data Load

Data pipeline setup in DataStori is a three-step process:

Configure Ingestion
Scheduling and Data Load
Destination Setup

This article covers pipeline scheduling and data loading, which is the second step after configuring data ingestion.

Scheduling

In the Schedules tab, set the time schedule on which the data pipeline is to run.

Schedules can be created, edited or viewed from this tab, as can all pipelines associated with a given schedule.

tip

Schedules (time zones) adjust automatically to daylight savings time.

Data Load

Next, we select the data load strategy, which defines how data is synchronized between source and destination. The objectives are to ensure that no source data is omitted, nor is any duplicated. In essence, data loads are either incremental or full refreshes.

The data load strategies built into DataStori are detailed here.

Incremental dedupe requires the following metadata:

Unique keys, which are defined using JSON Paths.
Sort keys, which decide the most recent record.

JSON Paths

JSON Path is a query language for JSON. DataStori lets users define the JSON Path to specify the required JSON fields.

A JSONPath expression specifies a path to an element (or a set of elements) in a JSON structure. Paths can use the dot notation. For example, $.store.book[0].title

Dots are only used before property names, not in brackets.

Example

{
    "store": {
        "book": [
            {
                "category": "reference",
                "author": "Nigel Rees",
                "title": "Sayings of the Century",
                "price": 8.95
            },
            {
                "category": "fiction",
                "author": "Evelyn Waugh",
                "title": "Sword of Honour",
                "price": 12.99
            },
            {
                "category": "fiction",
                "author": "Herman Melville",
                "title": "Moby Dick",
                "isbn": "0-553-21311-3",
                "price": 8.99
            },
            {
                "category": "fiction",
                "author": "J. R. R. Tolkien",
                "title": "The Lord of the Rings",
                "isbn": "0-395-19395-8",
                "price": 22.99
            }
        ],
        "bicycle": {
            "color": "red",
            "price": 19.95
        }
    },
    "expensive": 10
}

JsonPath	Result
$.store.book[*].author	The authors of all books
$..author	All authors
$.store.*	All things, both books and bicycles
$.store.book[*].price	The price of all the books
$..book[2]	The third book
$..book.length()	The number of books

Dedupe Keys / Unique Keys JSON Paths

Unique keys are required to identify unique records, to decide if the record needs to be updated or inserted. Unique keys are defined using the JSON Path. If the unique record is a composite key, users can define multiple JSONPaths.

In the above example, if 'category' and 'author' together define a unique key, then the unique key can be defined as a combination of $.store.book[*].category and $.store.book[*].author.

DataStori derives unique keys from the JSON Path and uses them to running incremental dedupe.

The UI provides a JSON path select editor as well as a manual input box.

Sort Keys JSON Paths

Sort keys are required to identify the latest record in order to eliminate duplicate records from the data load.

Sort keys are also defined using the JSON Path. Users can define multiple keys in sort keys, and data is sorted as per the specified keys.

Scheduling and Data Load

Scheduling​

Data Load​

JSON Paths​

Dedupe Keys / Unique Keys JSON Paths​

Sort Keys JSON Paths​

Scheduling

Data Load

JSON Paths

Dedupe Keys / Unique Keys JSON Paths

Sort Keys JSON Paths