Bundling Vs. UnBundling: The tale of Disjointed Lineage and Grieving Data Quality
Questioning the purpose of an orchestration engine
In the first part (Bundling Vs. UnBundling: The Tale of Airflow Operator and dbt Ref), we established a case for the importance of having a unified DAG view of the data ecosystem.
Now, what is the consequence of not having a unified view? Let’s walk you through the data lifecycle journey.
Data Product == Microservices, Why is this hard?
There are a few recent comparisons of developing data products with microservices. Data contract approach widely spoken and getting adopted. The data contract typically establishes a contract between the data producer and the downstream consumers. It is a lot similar to the microservices approach. Well, not entirely true.
In the Microservices, the contract agreed with a gRPC or REST interface. The producer takes known input and produces contractually agreed output to the consumer. The QoS (Quality of Service) bounds with the service response, and all the consumers have the same view about QoS. We can say a table or metrics equivalent to a microservice.
However, a downstream consumer connects with multiple tables in the data pipeline to produce a derived table(fact or metrics). The consumer views the data quality through the lens of aggregated views from multiple tables. The terms of the data quality from a consumer often reflect the consumer domain, and each consumer will have their definition of the data quality. There will be a 1:N relationship between producers and consumers.
A table/ metric is quality enough for one consumer but not for the other consumer.
The worst part of it, because the data computation is pipelining in nature, the data quality issues may not be visible to the immediate downstream consumer.
Hence the question: Who owns the data quality? Is it the data producer or the data consumer?
How do we approach data quality now?
We approach data quality in two strategies.
Data testing
Data testing typically involves checking Null values, data distribution, uniqueness, known invariants, etc.,
Data Observability
We can’t predict all the failure modes, and the data downtime happens all the time. Data observability is an end-to-end observability strategy to detect.
The data quality methods are beyond the scope of this blog. The blog Data Observability vs. data testing: Everything you need to know is a good overview. Dbt test, Great Expectations, AWS deque, are popular tools for data testing. There are pretty solid vendors in the data observability space.
Speed, Volume & Correctness: The Trade-off
Similar to the
CAP theorem
, there is a trade-off among Speed, Volume & Correctness in a data pipeline. You can get
only two out of three
in your pipeline system design.
Let us walk through how the data quality tools integrate with the data pipeline and the trade-off between speed and correctness. There is broadly two data pipeline pattern to support the data testing.
Staging-First approach
In the Staging-First approach, the producer writes the table in a staging environment and run through all the data quality check. If any test fails, and if it is a blocking failure, an incident is notified to the data producer to fix it.
Production-First approach
In the Production-First approach, the producer writes the table in the production environment and notifies all the consumers. The consumers may or may not choose to run the data quality verification.
In both approaches, the following property of the data pipeline remains the same.
Should we block the pipeline for all the consumers to finish their data quality check?
How do we measure the downstream impact of a pipeline failure?
How can we identify which upstream causes the pipeline failure?
How do we recover? What is the scheduling story for backfilling & error correction?
With the data observability tooling, the time delta between data production and error detection is high and often non-blocking.
Without the data orchestration engine having a full lineage view, it becomes more painful to operate the data pipeline. As the number of data assets grows, the complexity of the data ecosystem grows and falls on its success.
Data Mutation: The Elephant in the Room that we don’t acknowledge
I highlighted the Airflow fallacies in my talk, Operating data pipeline using Airflow @ Slack
.
Though it is called Airflow fallacies, it is indeed the “data pipeline fallacies.”
A couple of critical points to highlight;
The upstream task success is reliable.
The task remains static after the success state.
Data Mutation is inevitable in the data pipeline. It could be from an incomplete view from the upstream, a data quality & observability issue from downstream models, or GDPR and other compliance reasons.
The impact of operator & model disjoint execution unit
With the fragmentation between the task & model, the orchestration engines don’t have a complete lineage view of the data system. The data lineage is at present hidden
in a separate data discovery system. The backfilling & data correctness often requires jumping around multiple tools. Often a custom scheduler requires to manage data retention and lifecycle. The vital piece is “data lineage for all this to happen.”
Data Orchestration, Data Quality (testing, observability) & Data lineage are part of a task/ model lifecycle. If you can take one takeaway from the blog, this is it. Establishing trust in data will be a distant dream without a unified view of the data lifecycle.
Answering Tristan’s question
In a reflection on the previous post in the Analytics Engineering Roundup
, Tristan raised some great questions.
Question:
How much of the problem you’re describing here would go away if dbt Core natively supported non-SQL languages? Could we go back to just having a single DAG (now in dbt) or are there other barriers?
Answer:
If dbt Core can gain a single DAG view and with a pluggable architecture, the complexity of enterprise asset management will reduce. It is more of a throwback question for the dbt community. What is build stands for in dbt now and in the future?
Question:
Do you think the “task execution” and the “model” are actually different? Or can the concept of a dbt model be extended to handle what has historically been written as a task with minimal friction?
Answer:
I think task execution is a subset of model execution. The critical distinction is that a task can produce a null model.
Question:
As we think about how to push dbt in this direction, what do you think should be top of mind for us that we may not be considering?
Answer:
It can potentially introduce complexity to the existing simplicity in the incremental and materialized model. Keeping the simplistic opinionated abstraction and transferring dbt to a true “data operating system” will be challenging.
How can we fix it?
Well, Let me introduce my startup……… Lol No
. 😄😄😄😄
One of the best things that happened with the Modern Data Stack is more momentum and intelligent people solving data engineering challenges.
We thrive on solving data at an industrial scale, similar to the
cotton industry and the industrial revolution
. How can we produce consistent data insights similar to cotton production? A true challenge ahead in this decade.
We can take many paths to solve this.
Maybe the orchestration engine can have the end-to-end lineage and manage the data lifecycle.
Maybe the data lineage and discovery system can trigger a task/ model. Adding a new model is equivalent to adding a new edge in the lineage graph.
Maybe the observability tool provides a robust feedback loop to manage the data lifecycle.
If the problem statement resonates with you and you are looking for collaboration, I’m happy to lend my time in whatever way possible.
Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.
Very excited to see this discussion happening across a wide breadth of the data community with everyone bringing their respective lenses. The surface area and complexity of the problem most certainly thwarts any single data persona from grasping its entirety. I share some of my own perspectives here as an attempt to shine more light on the aspects that I have born witness to.
To mono-DAG or not to mono-DAG?
Managing dependencies in even moderately complex software environments is STILL mostly an unsolved problem. And as we have seen the data space tends to lag the software space on these fronts. I would argue that data dependencies have additional complexities to overcome. Even if we had Maven level dataset lineage and declarative dependencies and outputs (think a pom.xml for your DAG nodes), we would still be woefully ill-equipped. I think asking the DAG question is similar to deciding to mono-repo or not; there will be zealots on both sides, decisions will be highly context dependent, and there will be right and wrong ways to do both.
Task + Model
I am _mostly_ in agreement with Maxime's call for Functional Data Engineering (https://maximebeauchemin.medium.com/functional-data-engineering-a-modern-paradigm-for-batch-data-processing-2327ec32c42a). In this world, tasks (i.e. DAG nodes) are idempotent and the model (i.e. data tables) is a type system for describing inputs and outputs. These analogies are far from perfect and we need to modify and extend them in a number of ways for data paradigms but they are a reasonable starting point. While most functional programming constructs treat functions themselves as first-class citizens that can be provided as inputs to or be output from other functions, I have generally found using tasks as dependencies to other tasks an anti-pattern. This is related to the "upstream task is reliable" fallacy.
Task Scope + Data Quality
Strongly believe that data testing & observability belong as part of every task (just as unit testing and instrumentation are part any micro-service). Also strongly believe that tasks should do continuous testing (think CI/CD for your data sets) - good example from Sam Redai here (https://tabular.io/blog/integrated-audits/). On the plus side, data is generally designed to be single producer - multiple consumer which makes side-by-side comparisons of data observability straight forward (I expect to see this capability baked in and automated in orchestration tools if it is not already).