Data Engineering Weekly #61

Weekly Data Engineering Newsletter

Oct 25, 2021

Data Engineering Weekly - Brought to You by RudderStack - the Customer Data Platform for Developers

RudderStack Provides data pipelines that make it easy to collect data from every application, website, and SaaS platform, then activate it in your warehouse and business tools.

Benn Stancil: The future of operational analytics

Another great write-up from Benn Stancil on the future of operational analytics narrates why analytics is the experience. The narration on how the dashboard failing analogy is an exciting read that closely resembles the typical second system syndrome.

https://benn.substack.com/p/the-future-of-operational-analytics

Robert Yi: Signaling a tectonic shift in the transformation layer

Airbnb’s Minerva metrics layer, and the recent Looker & Tableau partnership triggered some exciting conversation on transformation layer vs. metrics layer. The author narrates the paradigm shift in the transformation layer and how the transformation layer & metrics layer complement each other.

https://robertyi.substack.com/p/signaling-a-tectonic-shift-in-the

Monte Carlo: The Future of the Data Engineer

The conversation is an excellent recap of the current state of data engineering and what the future holds with the fast-changing data tooling landscape. The narration on scalability & cost optimization, consensus & change management in a distributed ownership is an exciting read.

https://www.montecarlodata.com/the-future-of-the-data-engineer/

Monzo: An introduction to Monzo’s data stack

Monzo writes about an overview of its data infrastructure on Google Cloud. The usage of dbt and the wrapper tooling on top of dbt to speed up the execution is an exciting read. It is evident from the blog that one of the most significant challenges of data engineering is the ownership and the contract between the producer & consumers.

https://medium.com/data-monzo/an-introduction-to-monzos-data-stack-827ae531bc99

Petrica Leuca: What is data versioning and 3 ways to implement it

Data versioning is the essence of data pipelines. The authors narrate what data versioning is and three patterns to approach the data versioning.

Kimbal model - SCD pattern
A daily snapshot of the dimension table - functional pattern
CDC pipeline - event sourcing

https://medium.com/@petrica.leuca/what-is-data-versioning-and-3-ways-to-implement-it-4b6377bbdf93

Twitter: Processing billions of events in real-time at Twitter

Twitter writes about its journey on adopting the Kappa architecture pattern and the reasoning for moving away from the Lambda architecture pattern. The blog is an exciting read for the scalability challenges while maintaining the same view for real-time and batch analytics.

https://blog.twitter.com/engineering/en_us/topics/infrastructure/2021/processing-billions-of-events-in-real-time-at-twitter-

Uber: Introducing uGroup: Uber’s Consumer Management Framework

Consumer offset monitoring is critical for operating the streaming applications on top of Apache Kafka. Uber writes about uGroup, a Kafka consumer management framework.

https://eng.uber.com/introducing-ugroup-ubers-consumer-management-framework/

LinkedIn: Project Magnet, providing push-based shuffle, now available in Apache Spark 3.2

LinkedIn writes about MagnetA push-based shuffle is an implementation of shuffle where the shuffle blocks are pushed to the remote shuffle services from the mapper tasks in the past. The blog narrates an overview of the push-based shuffle, and now it is available as part of Spark 3.2 open source release.

https://engineering.linkedin.com/blog/2021/push-based-shuffle-in-apache-spark

Debezium: Using Debezium to Create a Data Lake with Apache Iceberg.

Apache Iceberg is an open table format for large analytic datasets. Debezium writes about how the Debezium server can add a new sink connector for creating the Apache Iceberg consumers to capture change data stream.

https://debezium.io/blog/2021/10/20/using-debezium-create-data-lake-with-apache-iceberg/

Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers' opinions.

Data Engineering Weekly