Skip to main content

3 posts tagged with "product management"

View All Tags

Data organization in DataStori

In DataStori we organize data into a classic three layer strategy. This pattern is now referred as Medallion architecture.

Medallion Architecture

Three layers for data organization:

Bronze layer: Raw data

This is where the raw data lands from APIs, Emails and SFTP servers. The data is stored in the form as it is received from the sources. In this layer the data can be in any format (JSON, XML, CSV, XLSX). The data in this layer should be used with most caution as it can have duplicates, inconsistencies and errors.

The data in this layer can be used for debugging and traceability in case any errors are reported in the downstream layers.

The technical expertise required to handle this layer is high.

Silver layer: Clean data

This is where data is read from the Bronze layer, cleaned, flattened, schema checked, deduped and checked for basic quality. The data in this layer is usually in Parquet or Delta file format. The data is this layer is good for consumption by data analysts and by data analysis tools.

The technical expertise required to handle the data in this layer is moderate.

Gold layer: Business data

The data in this layer is read from Silver layer and transformed as per your business requirements. For e.g. you want to join the data coming from Netsuite and ServiceTitan or if you want to perform a data aggregation. The data format in this layer can be anything that your business requires (Excel, CSV, Parquet).

You can have business rules in this layer to test the data for accuracy of KPIs and Metrics. The data in this layer is most suited for your reporting tools or reporting requirements.

The technical expertise required to handle the data in this layer is moderate.

Databases and SQL Query Engines

The data for the above three layers is created and stored in cloud storage (AWS S3, Azure Blob or Google Cloud storage).

With the data available in the Silver and Gold layers for consumption, what is the requirement of having a DB? Please note that the data in cloud storage is kept in form of files and you still need a query engine to run analysis on the data. This is why the data from Silver or Gold layer needs to either pushed to a database or use other tools that can offer you a querying mechanism on top of Gold or Silver layers directly.

E.g. of databases - Snowflake, Azure SQL, Postgres. E.g. of tools operating directly on your data in storage - SQL analytics endpoint in Microsoft Fabric endpoint (internally it too is a warehouse!)

Other ways of organizing the data

Not only Medallion but there are other data organization architectures too. For e.g. Lambda, Kappa, Data Marts, Data Mesh and your choice of architecture is dictated by your use case.

For DataStori we find Medallion architecture serving the use case of our customers the best. Medallion in addition is easier to understand, defines the ownership of each of the layers cleanly and makes it easier to establish lineage and is generic enough to support a broad set of use cases.

API Authentication methods

APIs are used for machine to machine data exchange. When a machine tries to access the data from the application, the machine needs to identify itself. The machines can identify themselves in broadly one of the following ways:

  • API Key
  • Basic Authentication
  • OAuth2 Authentication

API Key

The source application provides a static API key and the caller has to present this API key while fetching the information. Leaking of the API keys is equivalent to leaking of the credentials.

Basic Authentication

In Basic Authentication, the caller needs to present the username and password to access the data.

OAuth2 Authentication

In OAuth2, the caller needs to generate a access token (which is temporary) and pass the generated access token to the application to fetch the data. This is one of the most secure methods to fetch the data.

OAuth2 is a standard spec and the security can be further enhanced with additional properties (Refer to the High Security OAuth in OAuth2 spec)

In addition to security, OAuth2 also allows you to have a fine grained access control on your resources.

Comparison of Authentication types

Authentication MethodProsCons
API Key- Simple to implement and use- Less secure, keys can be easily shared or exposed
- Minimal overhead, fast performance- Difficult to provide fine-grained access control and permissions
- Compatible with most HTTP clients and servers- Managing and revoking API keys can be cumbersome
Basic Authentication- Simple to implement and understand- Transmits credentials (username and password) in Base64 encoding, which is not secure unless over HTTPS
- Supported by most HTTP clients and servers- Does not provide a way to handle token expiration or revocation
- Not suitable for complex, multi-user systems
OAuth2- Highly secure, supports granular access control with scopes- Complex to implement and requires more setup
- Provides token expiration and revocation mechanisms- Additional overhead due to token exchange process
- Suitable for third-party integrations and delegated access- Requires client and server to manage tokens and refresh tokens

Which Auth method should I use?

Depending upon your application, one or more auth methods might be allowed. Our preference in order of security is:

  • OAuth2
  • Basic Auth
  • API Key

Custom Authentication

We have often seen applications implementing their own authentication methods or the apps being partially compliant with the OAuth2 spec. With custom auth, the users need to be clear of the security impact and the best practices for such methods.

Overview

In this article we look at all the stages of the data journey from raw input data to analysis and output. We will briefly explore the process flows at each stage.

We classify the data journey from source application to reporting output into the following stages:

  1. Data Ingestion - Get raaw data from the source application
  2. Data Transformation - Apply business logic / rules to convert the data into a form that supports business analysis
  3. Data Analysis / Reporting - Analyze and present the transformed data in a manner that it can be consumed by the end-user

Integration options

Source - https://a16z.com/emerging-architectures-for-modern-data-infrastructure/

Data Ingestion

This stage deals with reading data from source applications and moving it to a central storage location. For example, ingest accounting data from NetSuite and store it in AWS S3. While doing this, DataStori ensures that:

  1. Raw data is available in a consistent and reliable manner.
  2. A copy of deduped data is made in addition to the raw data.
  3. Input data meets basic quality parameters, e.g., no null records, fresh data check, primary key check.
  4. The data to be ingested is standardized and documented.

The above steps ensure input data quality and freshness and do not involve business logic. This stage is implemented by the data engineering team of DataStori.

Popular data connectors include DataStori, Fivetran and Airbyte.

Data Transformation

In this stage we tranform the ingested raw data into business data. In this stage we focus on:

  1. Transforming the input data into the form required by the business.
  2. Defining relationships between the entities in the data warehouse.
  3. Writing data checks on business KPIs. E.g. sales this month > $10,000 OR NumofJobPerTechnician > 3
  4. Document and write the business transformed data to a warehouse table for further consumption.
info

While running data transformations, it is essential to ensure that the definitions of business KPIs are consistent across pipelines. For example, we may write one pipeline for sales by month and another for sales per employee per month. It is important that we keep the definition of sales consistent across the two transformation phases.

This stage is jointly implemented by the DataStori engineering team (which provides the platform to run transformations) and the client's business analyst (who define the business rules and transformations).

Popular data transformation tools include DBT and Apache Spark.

Data Analysis / Reporting

The transformed data is finally used for business analysis and reporting. This stage involves presenting the business transfotmed data to different end-users in forms consumable by them. For example, financial reports for management often vary from the ones required for the operations team, though the underlying data for both reports is the same.

This stage is implemented by BI developers and consultants in the client organization.

Popular analytics and reporting tools include Power BI, Tableau, R and Python.

Supporting Stages/Tools

To ensure that data processes including the above three stages run in a consistent, predictable and manageable manner, DataStori uses numerous supporting technologies and stages. In this section we briefly describe the key tools and the process stages in which they are used.

Workflow managers

Workflow managers are critical to managing complex data pipelines. Their primary functions include scheduling, automation, task dependency management, monitoring and logging, scalability, error handling and retries.

Popular workflow management tools include Airflow and Prefect.

Data quality checks

To make sure that our data pipelines are running and fetching data as expected, we can add on additional tooling to run data quality checks when it is fetched and transformed.

Popular data quality check tools include Great Expectations and AccelData.

Data discovery

As the data and the people working on it increase, keeping track of all the data assets becomes an issue. To keep our datasets fresh and documented, we can add tools to help us with the data organization.

Popular tools to aid discovery include Amundsen and Atlan.

Data governance

With increasing data volume and consumption, we may need to define access controls on the data assets. To help us with this, we can add tools around governance.

info

Based on the maturity and necessity of your organization, you can decide which of the supporting stages and tools to add.