Discover more from Data Engineering Weekly
Bundling Vs. UnBundling: The Tale of Airflow Operator and dbt Ref
A throwback lesson on the history of data orchestration engine
I started working on the data pipeline at the early stage of Hadoop/ Bigdata when Big Data was a buzzword. Apache Oozie (anyone remembers Oozie?) is a go-to tool to orchestrate the data pipeline, where you have to hand-code workflow in an XML file(not surprisingly, the file name is workflow.xml).
Apache Airflow has improved the data pipeline massively. The ability to programmatically author a data pipeline, the concept of DAGs, and on top of that, a fantastic usable UI for the first time is an order of magnitude better than the previous tools.
Airflow - The Task execution
Airflow’s initial release was June 2015, six months after Snowflake launched its commercial offering. The cloud data warehouses barely started to take off at that point. The orchestration engine mainly focuses on running Hadoop Map Reduce, Pig, Crunch & Hive jobs. We’ve seen fragmented big data processing frameworks at that point before Apache Spark unified the SQL and fluent data processing model.
Airflow adopted the task (Airflow operators) as a functional computing unit. Airflow provided sensors and task dependency to capture the dependency. The Airflow DAG model is an excellent abstraction to define the task dependency; however, the visibility of the task lineages is limited to a single DAG. If you want to run through backfilling, we had to develop a dedicated tool to understand the overall dependency.
A few companies adopted the “One DAG” model to bundle all the tasks in one pipeline, where at the Slack data team, we build a DAG parser to construct a unified DAG view [see my talk: Operating Data Pipeline using Airflow @ Slack - 2018]
The challenges with the Task execution
As the raw data transformed into more structured tables, we’ve seen a pattern where Hive SQL was adopted as a de-facto standard to build the data model. SQL won the data processing abstraction. The task dependency proved to be hard to manage since the unit of transformation logic is expressed as the model in SQL, and finding the task dependency for each Hive table is a hectic job. Airflow provides HivePartitionSensor & SQL Sensors, yet it doesn’t solve the underlying problem of task dependency for a model framework.
dbt - The model execution
By 2019, Cloud data warehouses got adopted widely. The task execution was not adopted well with the cloud datawarehouse systems and entered dbt. The most important function in dbt is ref(). The dbt uses these references between models to build the dependency graph automatically.
The reference model dramatically simplifies the data pipeline, and dbt becomes the defacto tool for the data pipeline. It is important to note that Airflow also supported other popular features like the Jinja template and dbt documentation. I captured my thoughts on dbt here.
The impact of separate execution units
The task & model execution unit results from the diverse data processing frameworks and the need to support various business use cases. SQL is well suited for data analytics/ BI, where ML & raw data processing uses more of a fluent interface.
As a consequence of the separate execution unit, both the task orchestration engine and model execution engine can’t have an end-to-end data lineage of an organization. It paved the path for specialized data lineage/ metadata systems.
The data engineering weekly wrote a special edition for capturing the history of metadata systems.
The handshake between the task unit & model unit increasing became complicated. We started to see the rise of data observability tools emphasizing data contracts and data certifications for defining specific data quality standards and SLA.
The beginning of unbundling the data platform
The unbundling of the data platform is a consequence of a separate execution model. The orchestration engines have less and less context on the org-wide data management state.
At the same time, we should also acknowledge that data orchestration, data quality, lineage, and model management are significant problems on their own. The individual tools are trying to solve specific problems; however, looking from an overall architecture perspective results in duct tape systems. It makes the dream of data an asset is still an aspirational/ unrealistic goal for many companies.
Bundling: Data Operating System
The data community often compares the modern tech stack with the Unix philosophy. However, we are missing the operating system for the data. We need to merge both the model and task execution unit into one unit. Otherwise, any abstraction we build without the unification will further amplify the disorganization of the data. The data as an asset will remain an aspirational goal.
We need to merge both the model and task execution unit into one unit. Otherwise, any abstraction we build without the unification will further amplify the disorganization of the data.
You can ask, What is the other organizational chaos you see in your experience? Let me answer that in my next blog by answering a simple question.
Who owns data quality? Let’s find out in the next part.
Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.