Data Engineering Weekly #75
Weekly Data Engineering Newsletter
Data Council - Austin 2022
Data Council published the Austin 2022 schedule
here. The data engineering weekly readers can get a 20% discount using promo code
DataWeekly20. I’m excited to attend the Data Council in person and hope to meet you all.
Dagster: Bundling Vs UnBundling the Data Platform
It is an exciting week at the data land. Gorkem Yurtseven started the conversation with an excellent write-up, The Unbundling of Airflow. The blog ends with
A diverse set of tools is unbundling Airflow, and this diversity is causing substantial fragmentation in the modern data stack. Like everyone else, I also predict some consolidation of these tools in the coming years.
Dagster writes about its mission of
Rebundling the Data Platform.
Having this many tools without a coherent, centralized control plane is lunacy and a terrible end state for data practitioners and their stakeholders.
I have a few thoughts to share on this, which I plan to write a blog about. I do believe in the consolidation of the tools soon. dbt is iterating its metadata management layer [
Leveraging dbt metadata in data management], Atlan stepping into
data exploration and visualization space.
I often hear comparisons of the modern data stack with the Unix philosophy. Who is the Unix terminal of the modern data stack is the billion-dollar question!! I believe the race has already started, which is excellent news for the data practitioners.
Prefect: Logs, the Prefect Way
Observability into the orchestration engine is vital for operating the data pipeline reliably. Prefect writes about Orion logging, A Pythonic logging system designed to maximize observability with a minimum of effort.
Pinterest: Spinner - Pinterest’s Workflow Platform
Staying with the workflow orchestration, Pinterest writes about its migration of internal workflow orchestration engine Pinball to Apache Airflow.
Apache Arrow: Introducing Apache Arrow Flight SQL - Accelerating Database Access
Apache Airflow introduced Flight SQL, a new client-server protocol developed by the Apache Arrow community for interacting with SQL databases. This tweet summarizes how significant this development is,
Kevin Kho: Introducing Fugue — Reducing PySpark Developer Friction
Kevin Kho writes about Fugue, an open-source abstraction layer that provides a seamless transition from a single machine to a distributed computing setting. The article narrates the inconsistency between Pandas and PySpark, how Fugue can help to bridge the gap and increase developer productivity.
Mikkel Dengsøe: Data, engineers, and designers - How the US compares to Europe
The median data to engineers ratio for the US companies I looked at is 1:7 compared to 1:4 for the European companies. And the design to engineers ratio is 1:9 for both groups.
There are many surprises in this study. Analytics companies are in fact not that analytical!!!. Are US companies automating more data engineering functions, or EU leap ahead in adopting data practitioners? I tend to believe the latter but will be curious to know the operating principles of the companies.
Zach Quinn: Why Data Engineers Must Have Domain Knowledge — And How To Gain It
I switched between backend engineering and data engineering in my career. What excites me in data engineering is the uniqueness of thinking from the business perspective on what data points require to run a business.
A simple question like, What is the count of unique users itself can reveal how the business operates.
The author writes an exciting article narrating why data engineers must have domain knowledge and an approach to acquire it.
Salesforce: Embracing Mutable Big Data
Salesforce writes about the importance of mutability in the Big Data ecosystem and an overview of its Activity Platform. We have seen the rise of LakeHouse architecture like Apache Hudi, Iceberg, and DeltaLake supports the mutability of the data. Modern OLAP engines like Apache Pinot embraces the mutability of the data.
Mutability is not bad as long as the transaction is limited to a bounded context., and it's time to embrace the mutability.
Microsoft DataScience: Natural Language Understanding What’s the purpose of meaning?
Microsoft writes an exciting blog that provides an overview of some critical elements in Natural Language Understanding (NLU). The blog is an excellent overview to get the big picture of Natural Language Processing.
Back Market: From Delta Lake to BigQuery
Last November, we have seen the infamous
performance benchmark warfare between Snowflake and Databricks. The blog is a curious read because the Back Market operates Delta Lake, Snowflake, and Google Big Query!!!.
Foodpanda: How foodpanda reduced 45% of our BigQuery cost with reservations slots
Foodpanda shares some great insights on Google BigQuery pricing, best practices to monitor the cost, and the utilization of reservation slots to reduce the cost by 45%.
Hifly Labs: Awesome dbt
awesome-dbt is an excellent collection of dbt resources with sample projects. Thank you, Son N. Nguyen, for sharing the repo.
Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.