Data Engineering Weekly

Share this post

Bundling Vs. UnBundling: The Tale of Airflow Operator and dbt Ref

www.dataengineeringweekly.com

Bundling Vs. UnBundling: The Tale of Airflow Operator and dbt Ref

A throwback lesson on the history of data orchestration engine

Ananth Packkildurai
Feb 25, 2022
6
2
Share this post

Bundling Vs. UnBundling: The Tale of Airflow Operator and dbt Ref

www.dataengineeringweekly.com

I started working on the data pipeline at the early stage of Hadoop/ Bigdata when Big Data was a buzzword. Apache Oozie (anyone remembers Oozie?) is a go-to tool to orchestrate the data pipeline, where you have to hand-code workflow in an XML file(not surprisingly, the file name is workflow.xml).

Apache Airflow has improved the data pipeline massively. The ability to programmatically author a data pipeline, the concept of DAGs, and on top of that, a fantastic usable UI for the first time is an order of magnitude better than the previous tools.

Airflow - The Task execution

Airflow’s initial release was June 2015, six months after Snowflake launched its commercial offering. The cloud data warehouses barely started to take off at that point. The orchestration engine mainly focuses on running Hadoop Map Reduce, Pig, Crunch & Hive jobs. We’ve seen fragmented big data processing frameworks at that point before Apache Spark unified the SQL and fluent data processing model. 

Airflow adopted the task (Airflow operators) as a functional computing unit. Airflow provided sensors and task dependency to capture the dependency. The Airflow DAG model is an excellent abstraction to define the task dependency; however, the visibility of the task lineages is limited to a single DAG. If you want to run through backfilling, we had to develop a dedicated tool to understand the overall dependency. 

A few companies adopted the “One DAG” model to bundle all the tasks in one pipeline, where at the Slack data team, we build a DAG parser to construct a unified DAG view [see my talk: Operating Data Pipeline using Airflow @ Slack - 2018]

The challenges with the Task execution

As the raw data transformed into more structured tables, we’ve seen a pattern where Hive SQL was adopted as a de-facto standard to build the data model. SQL won the data processing abstraction. The task dependency proved to be hard to manage since the unit of transformation logic is expressed as the model in SQL, and finding the task dependency for each Hive table is a hectic job. Airflow provides HivePartitionSensor & SQL Sensors, yet it doesn’t solve the underlying problem of task dependency for a model framework. 

dbt - The model execution

By 2019, Cloud data warehouses got adopted widely. The task execution was not adopted well with the cloud datawarehouse systems and entered dbt. The most important function in dbt is ref(). The dbt uses these references between models to build the dependency graph automatically.

The reference model dramatically simplifies the data pipeline, and dbt becomes the defacto tool for the data pipeline. It is important to note that Airflow also supported other popular features like the Jinja template and dbt documentation. I captured my thoughts on dbt here.

Twitter avatar for @ananthdurai
Ananth Packkildurai @ananthdurai
Here are the Top 5 reasons why I consider @getdbt groundbreaking. 1. The DBT data model creates a logical separation of tables or the metadata engines like Hive meta store from the data pipeline. @ApacheAirflow tried with the tasks & operators, but not wholly successful.
5:48 AM ∙ Feb 24, 2020
23Likes4Retweets

The impact of separate execution units

The task & model execution unit results from the diverse data processing frameworks and the need to support various business use cases. SQL is well suited for data analytics/ BI, where ML & raw data processing uses more of a fluent interface. 

As a consequence of the separate execution unit, both the task orchestration engine and model execution engine can’t have an end-to-end data lineage of an organization. It paved the path for specialized data lineage/ metadata systems.

The data engineering weekly wrote a special edition for capturing the history of metadata systems.

Data Engineering Weekly
Data Engineering Weekly #21: Metadata Edition
Welcome to the 21st edition of the data engineering newsletter. The 21st edition of the newsletter focuses on the recent breakthroughs in metadata management. I believe the next big set of challenges in data engineering is all about efficient data management…
Read more
2 years ago · 1 like · Ananth Packkildurai

The handshake between the task unit & model unit increasing became complicated. We started to see the rise of data observability tools emphasizing data contracts and data certifications for defining specific data quality standards and SLA. 

The beginning of unbundling the data platform

The unbundling of the data platform is a consequence of a separate execution model. The orchestration engines have less and less context on the org-wide data management state. 

Twitter avatar for @schrockn
Nick Schrock @schrockn
6/ Teams use orchestrators but it knows less and less about the inner workings of the data platform. Therefore the platform has operational silos and invasive, heavyweight integrations for basic features. @ananthdurai put this starkly:
Twitter avatar for @ananthdurai
Ananth Packkildurai @ananthdurai
@sarahmk125 MDS is a set of vendor tools that solve niche data problems (lineage, orchestration, quality) with the side effect of creating a disjointed data workflow that makes data folks lives more complicated.
8:21 PM ∙ Feb 17, 2022

At the same time, we should also acknowledge that data orchestration, data quality, lineage, and model management are significant problems on their own. The individual tools are trying to solve specific problems; however, looking from an overall architecture perspective results in duct tape systems. It makes the dream of data an asset is still an aspirational/ unrealistic goal for many companies. 

 

Twitter avatar for @ananthdurai
Ananth Packkildurai @ananthdurai
As the size of the organization grows, the data maturity shrink. The complexity of the data outgrown the usability the data. Have anyone seen this pattern? Curious to know data folks' thoughts on it.
6:52 PM ∙ Feb 10, 2022
2Likes2Retweets

Bundling: Data Operating System

The data community often compares the modern tech stack with the Unix philosophy. However, we are missing the operating system for the data. We need to merge both the model and task execution unit into one unit. Otherwise, any abstraction we build without the unification will further amplify the disorganization of the data.  The data as an asset will remain an aspirational goal.

We need to merge both the model and task execution unit into one unit. Otherwise, any abstraction we build without the unification will further amplify the disorganization of the data. 

You can ask, What is the other organizational chaos you see in your experience? Let me answer that in my next blog by answering a simple question.

Who owns data quality? Let’s find out in the next part.


Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.

2
Share this post

Bundling Vs. UnBundling: The Tale of Airflow Operator and dbt Ref

www.dataengineeringweekly.com
2 Comments
Tristan Handy
Feb 26, 2022Liked by Ananth Packkildurai

The first data substack ever with a cliffhanger. Freaking Stephen King of substacks over here. Thanks Ananth, how long do I have to wait to know the punch line?!?!

Expand full comment
Reply
1 reply by Ananth Packkildurai
1 more comment…
TopNewCommunity

No posts

Ready for more?

© 2023 Ananth Packkildurai
Privacy ∙ Terms ∙ Collection notice
Start WritingGet the app
Substack is the home for great writing