Data Engineering Weekly #61

Weekly Data Engineering Newsletter

Data Engineering Weekly - Brought to You by RudderStack - the Customer Data Platform for Developers

RudderStack Provides data pipelines that make it easy to collect data from every application, website, and SaaS platform, then activate it in your warehouse and business tools.


Sponsored: Announcing O'Reilly Data Quality Fundamentals Book. Now Available: Exclusive access to the first two chapters!

Thrilled to announce the release of O'Reilly's first-ever book on data quality, Data Quality Fundamentals: A Practitioner's Guide to Building More Trustworthy Data Pipelines! In this book, the Data Observability category creators make the business case for data trust and explain how data leaders can tackle data quality at scale by leveraging best practices and technologies used by some of the world’s most innovative companies.

Download Your Free Copy


Benn Stancil: The future of operational analytics

Another great write-up from Benn Stancil on the future of operational analytics narrates why analytics is the experience. The narration on how the dashboard failing analogy is an exciting read that closely resembles the typical second system syndrome.

https://benn.substack.com/p/the-future-of-operational-analytics


Robert Yi: Signaling a tectonic shift in the transformation layer

Airbnb’s Minerva metrics layer, and the recent Looker & Tableau partnership triggered some exciting conversation on transformation layer vs. metrics layer. The author narrates the paradigm shift in the transformation layer and how the transformation layer & metrics layer complement each other.

https://robertyi.substack.com/p/signaling-a-tectonic-shift-in-the


Monte Carlo: The Future of the Data Engineer

The conversation is an excellent recap of the current state of data engineering and what the future holds with the fast-changing data tooling landscape. The narration on scalability & cost optimization, consensus & change management in a distributed ownership is an exciting read.

https://www.montecarlodata.com/the-future-of-the-data-engineer/


Monzo: An introduction to Monzo’s data stack

Monzo writes about an overview of its data infrastructure on Google Cloud. The usage of dbt and the wrapper tooling on top of dbt to speed up the execution is an exciting read. It is evident from the blog that one of the most significant challenges of data engineering is the ownership and the contract between the producer & consumers.

https://medium.com/data-monzo/an-introduction-to-monzos-data-stack-827ae531bc99


Petrica Leuca: What is data versioning and 3 ways to implement it

Data versioning is the essence of data pipelines. The authors narrate what data versioning is and three patterns to approach the data versioning.

  1. Kimbal model - SCD pattern

  2. A daily snapshot of the dimension table - functional pattern

  3. CDC pipeline - event sourcing

https://medium.com/@petrica.leuca/what-is-data-versioning-and-3-ways-to-implement-it-4b6377bbdf93


Sponsored: Live Tech Session - The Modern Data Stack Is Warehouse-First

Join leaders from Snowflake, Mammoth Growth, RudderStack, and Mixpanel to learn why the most sophisticated teams architect their data stacks around the data warehouse.

https://rudderstack.com/video-library/the-modern-data-stack-is-warehouse-first


Twitter: Processing billions of events in real-time at Twitter

Twitter writes about its journey on adopting the Kappa architecture pattern and the reasoning for moving away from the Lambda architecture pattern. The blog is an exciting read for the scalability challenges while maintaining the same view for real-time and batch analytics.

https://blog.twitter.com/engineering/en_us/topics/infrastructure/2021/processing-billions-of-events-in-real-time-at-twitter-


Uber: Introducing uGroup: Uber’s Consumer Management Framework

Consumer offset monitoring is critical for operating the streaming applications on top of Apache Kafka. Uber writes about uGroup, a Kafka consumer management framework.

https://eng.uber.com/introducing-ugroup-ubers-consumer-management-framework/


LinkedIn: Project Magnet, providing push-based shuffle, now available in Apache Spark 3.2

LinkedIn writes about MagnetA push-based shuffle is an implementation of shuffle where the shuffle blocks are pushed to the remote shuffle services from the mapper tasks in the past. The blog narrates an overview of the push-based shuffle, and now it is available as part of Spark 3.2 open source release.

https://engineering.linkedin.com/blog/2021/push-based-shuffle-in-apache-spark


Debezium: Using Debezium to Create a Data Lake with Apache Iceberg.

Apache Iceberg is an open table format for large analytic datasets. Debezium writes about how the Debezium server can add a new sink connector for creating the Apache Iceberg consumers to capture change data stream.

https://debezium.io/blog/2021/10/20/using-debezium-create-data-lake-with-apache-iceberg/


Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers' opinions.