Data Engineering Weekly #75

Weekly Data Engineering Newsletter

Feb 21, 2022

Data Council - Austin 2022

Data Council published the Austin 2022 schedule here. The data engineering weekly readers can get a 20% discount using promo code DataWeekly20. I’m excited to attend the Data Council in person and hope to meet you all.

https://www.datacouncil.ai/austin

Dagster: Bundling Vs UnBundling the Data Platform

It is an exciting week at the data land. Gorkem Yurtseven started the conversation with an excellent write-up, The Unbundling of Airflow. The blog ends with

A diverse set of tools is unbundling Airflow, and this diversity is causing substantial fragmentation in the modern data stack. Like everyone else, I also predict some consolidation of these tools in the coming years.

Dagster writes about its mission of Rebundling the Data Platform.

Having this many tools without a coherent, centralized control plane is lunacy and a terrible end state for data practitioners and their stakeholders.

I have a few thoughts to share on this, which I plan to write a blog about. I do believe in the consolidation of the tools soon. dbt is iterating its metadata management layer [Leveraging dbt metadata in data management], Atlan stepping into data exploration and visualization space.

I often hear comparisons of the modern data stack with the Unix philosophy. Who is the Unix terminal of the modern data stack is the billion-dollar question!! I believe the race has already started, which is excellent news for the data practitioners.

https://blog.fal.ai/the-unbundling-of-airflow-2/

https://dagster.io/blog/rebundling-the-data-platform

Prefect: Logs, the Prefect Way

Observability into the orchestration engine is vital for operating the data pipeline reliably. Prefect writes about Orion logging, A Pythonic logging system designed to maximize observability with a minimum of effort.

https://medium.com/the-prefect-blog/logs-the-prefect-way-a9e6923185fb

Pinterest: Spinner - Pinterest’s Workflow Platform

Staying with the workflow orchestration, Pinterest writes about its migration of internal workflow orchestration engine Pinball to Apache Airflow.

https://medium.com/pinterest-engineering/spinner-pinterests-workflow-platform-c5bbe190ba5

Apache Arrow: Introducing Apache Arrow Flight SQL - Accelerating Database Access

Apache Airflow introduced Flight SQL, a new client-server protocol developed by the Apache Arrow community for interacting with SQL databases. This tweet summarizes how significant this development is,

Josh Wills @josh_wills

The dumbest piece of the modern data stack: 1. Store columnar data in an RDBMS. 2. Pivot that data to a row-oriented format for transport over the wire. 3. Pivot the row-oriented format back to columnar for viz/analysis. We gotta fix this.

arrow.apache.orgIntroducing Apache Arrow Flight SQL: Accelerating Database AccessThis post introduces Arrow Flight SQL, a protocol for interacting with SQL databases over Arrow Flight. We have been working on this protocol over the last six months, and are looking for feedback, interested contributors, and early adopters.

https://arrow.apache.org/blog/2022/02/16/introducing-arrow-flight-sql/

Kevin Kho: Introducing Fugue — Reducing PySpark Developer Friction

Kevin Kho writes about Fugue, an open-source abstraction layer that provides a seamless transition from a single machine to a distributed computing setting. The article narrates the inconsistency between Pandas and PySpark, how Fugue can help to bridge the gap and increase developer productivity.

https://towardsdatascience.com/introducing-fugue-reducing-pyspark-developer-friction-a702230455de

Mikkel Dengsøe: Data, engineers, and designers - How the US compares to Europe

The median data to engineers ratio for the US companies I looked at is 1:7 compared to 1:4 for the European companies. And the design to engineers ratio is 1:9 for both groups.

There are many surprises in this study. Analytics companies are in fact not that analytical!!!. Are US companies automating more data engineering functions, or EU leap ahead in adopting data practitioners? I tend to believe the latter but will be curious to know the operating principles of the companies.

https://towardsdatascience.com/data-engineers-and-designers-how-us-compares-to-europe-e1ce6f0a8908

Zach Quinn: Why Data Engineers Must Have Domain Knowledge — And How To Gain It

I switched between backend engineering and data engineering in my career. What excites me in data engineering is the uniqueness of thinking from the business perspective on what data points require to run a business.

A simple question like, What is the count of unique users itself can reveal how the business operates.

The author writes an exciting article narrating why data engineers must have domain knowledge and an approach to acquire it.

https://medium.com/pipeline-a-data-engineering-resource/why-data-engineers-must-have-domain-knowledge-and-how-to-gain-it-e9228ff3350d

Salesforce: Embracing Mutable Big Data

Salesforce writes about the importance of mutability in the Big Data ecosystem and an overview of its Activity Platform. We have seen the rise of LakeHouse architecture like Apache Hudi, Iceberg, and DeltaLake supports the mutability of the data. Modern OLAP engines like Apache Pinot embraces the mutability of the data.

Ananth Packkildurai @ananthdurai

@richardstartin In my personal experience, upsert support( row mutation) is the critical differentiator for Pinot. It gives more flexibility modeling analytics for business state changes.

Mutability is not bad as long as the transaction is limited to a bounded context., and it's time to embrace the mutability.

https://engineering.salesforce.com/embracing-mutable-big-data-bf7106c2064d

Microsoft DataScience: Natural Language Understanding What’s the purpose of meaning?

Microsoft writes an exciting blog that provides an overview of some critical elements in Natural Language Understanding (NLU). The blog is an excellent overview to get the big picture of Natural Language Processing.

Part 1: https://medium.com/data-science-at-microsoft/natural-language-understanding-whats-the-purpose-of-meaning-part-1-of-2-18a370a763

Part 2: https://medium.com/data-science-at-microsoft/natural-language-understanding-whats-the-purpose-of-meaning-part-2-of-2-cfac532103d4

Back Market: From Delta Lake to BigQuery

Last November, we have seen the infamous performance benchmark warfare between Snowflake and Databricks. The blog is a curious read because the Back Market operates Delta Lake, Snowflake, and Google Big Query!!!.

https://medium.com/back-market-engineering/from-delta-lake-to-bigquery-ac2cee830b24

Foodpanda: How foodpanda reduced 45% of our BigQuery cost with reservations slots

Foodpanda shares some great insights on Google BigQuery pricing, best practices to monitor the cost, and the utilization of reservation slots to reduce the cost by 45%.

https://medium.com/foodpanda-data/how-foodpanda-reduced-45-of-our-bigquery-cost-with-reservations-slots-2c79e1d37e4

Hifly Labs: Awesome dbt

awesome-dbt is an excellent collection of dbt resources with sample projects. Thank you, Son N. Nguyen, for sharing the repo.

https://github.com/Hiflylabs/awesome-dbt

Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.

Data Engineering Weekly