Bundling Vs. UnBundling: The Tale of Airflow Operator and dbt Ref

A throwback lesson on the history of data orchestration engine

Feb 25, 2022

I started working on the data pipeline at the early stage of Hadoop/ Bigdata when Big Data was a buzzword. Apache Oozie (anyone remembers Oozie?) is a go-to tool to orchestrate the data pipeline, where you have to hand-code workflow in an XML file(not surprisingly, the file name is workflow.xml).

Apache Airflow has improved the data pipeline massively. The ability to programmatically author a data pipeline, the concept of DAGs, and on top of that, a fantastic usable UI for the first time is an order of magnitude better than the previous tools.

Airflow - The Task execution

Airflow’s initial release was June 2015, six months after Snowflake launched its commercial offering. The cloud data warehouses barely started to take off at that point. The orchestration engine mainly focuses on running Hadoop Map Reduce, Pig, Crunch & Hive jobs. We’ve seen fragmented big data processing frameworks at that point before Apache Spark unified the SQL and fluent data processing model.

Airflow adopted the task (Airflow operators) as a functional computing unit. Airflow provided sensors and task dependency to capture the dependency. The Airflow DAG model is an excellent abstraction to define the task dependency; however, the visibility of the task lineages is limited to a single DAG. If you want to run through backfilling, we had to develop a dedicated tool to understand the overall dependency.

A few companies adopted the “One DAG” model to bundle all the tasks in one pipeline, where at the Slack data team, we build a DAG parser to construct a unified DAG view [see my talk: Operating Data Pipeline using Airflow @ Slack - 2018]

The challenges with the Task execution

As the raw data transformed into more structured tables, we’ve seen a pattern where Hive SQL was adopted as a de-facto standard to build the data model. SQL won the data processing abstraction. The task dependency proved to be hard to manage since the unit of transformation logic is expressed as the model in SQL, and finding the task dependency for each Hive table is a hectic job. Airflow provides HivePartitionSensor & SQL Sensors, yet it doesn’t solve the underlying problem of task dependency for a model framework.

dbt - The model execution

By 2019, Cloud data warehouses got adopted widely. The task execution was not adopted well with the cloud datawarehouse systems and entered dbt. The most important function in dbt is ref(). The dbt uses these references between models to build the dependency graph automatically.

The reference model dramatically simplifies the data pipeline, and dbt becomes the defacto tool for the data pipeline. It is important to note that Airflow also supported other popular features like the Jinja template and dbt documentation. I captured my thoughts on dbt here.

Ananth Packkildurai @ananthdurai

Here are the Top 5 reasons why I consider @getdbt groundbreaking. 1. The DBT data model creates a logical separation of tables or the metadata engines like Hive meta store from the data pipeline. @ApacheAirflow tried with the tasks & operators, but not wholly successful.

The impact of separate execution units

The task & model execution unit results from the diverse data processing frameworks and the need to support various business use cases. SQL is well suited for data analytics/ BI, where ML & raw data processing uses more of a fluent interface.

As a consequence of the separate execution unit, both the task orchestration engine and model execution engine can’t have an end-to-end data lineage of an organization. It paved the path for specialized data lineage/ metadata systems.

The data engineering weekly wrote a special edition for capturing the history of metadata systems.

Data Engineering Weekly

Data Engineering Weekly #21: Metadata Edition

Welcome to the 21st edition of the data engineering newsletter. The 21st edition of the newsletter focuses on the recent breakthroughs in metadata management. I believe the next big set of challenges in data engineering is all about efficient data management…

5 years ago · 1 like · Ananth Packkildurai

The handshake between the task unit & model unit increasing became complicated. We started to see the rise of data observability tools emphasizing data contracts and data certifications for defining specific data quality standards and SLA.

The beginning of unbundling the data platform

The unbundling of the data platform is a consequence of a separate execution model. The orchestration engines have less and less context on the org-wide data management state.

Nick Schrock @schrockn

6/ Teams use orchestrators but it knows less and less about the inner workings of the data platform. Therefore the platform has operational silos and invasive, heavyweight integrations for basic features. @ananthdurai put this starkly:

Ananth Packkildurai @ananthdurai

@sarahmk125 MDS is a set of vendor tools that solve niche data problems (lineage, orchestration, quality) with the side effect of creating a disjointed data workflow that makes data folks lives more complicated.

At the same time, we should also acknowledge that data orchestration, data quality, lineage, and model management are significant problems on their own. The individual tools are trying to solve specific problems; however, looking from an overall architecture perspective results in duct tape systems. It makes the dream of data an asset is still an aspirational/ unrealistic goal for many companies.

Ananth Packkildurai @ananthdurai

As the size of the organization grows, the data maturity shrink. The complexity of the data outgrown the usability the data. Have anyone seen this pattern? Curious to know data folks' thoughts on it.

Bundling: Data Operating System

The data community often compares the modern tech stack with the Unix philosophy. However, we are missing the operating system for the data. We need to merge both the model and task execution unit into one unit. Otherwise, any abstraction we build without the unification will further amplify the disorganization of the data. The data as an asset will remain an aspirational goal.

We need to merge both the model and task execution unit into one unit. Otherwise, any abstraction we build without the unification will further amplify the disorganization of the data.

You can ask, What is the other organizational chaos you see in your experience? Let me answer that in my next blog by answering a simple question.

Who owns data quality? Let’s find out in the next part.

Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.

Data Engineering Weekly

Bundling Vs. UnBundling: The Tale of Airflow Operator and dbt Ref

A throwback lesson on the history of data orchestration engine

Airflow - The Task execution

The challenges with the Task execution

dbt - The model execution

The impact of separate execution units

The beginning of unbundling the data platform

Bundling: Data Operating System

Discussion about this post