Data Engineering Weekly - Brought to You by RudderStack - the Customer Data Platform for Developers
RudderStack
Provides data pipelines that make it easy to collect data from every application, website, and SaaS platform, then activate it in your warehouse and business tools.
Sponsored: Announcing O'Reilly Data Quality Fundamentals Book. Now Available: Exclusive access to the first two chapters!
Thrilled to announce the release of O'Reilly's first-ever book on data quality, Data Quality Fundamentals: A Practitioner's Guide to Building More Trustworthy Data Pipelines! In this book, the Data Observability category creators make the business case for data trust and explain how data leaders can tackle data quality at scale by leveraging best practices and technologies used by some of the world’s most innovative companies.
Benn Stancil: The future of operational analytics
Another great write-up from Benn Stancil on the future of operational analytics narrates why analytics is the experience. The narration on how the dashboard failing analogy is an exciting read that closely resembles the typical second system syndrome.
https://benn.substack.com/p/the-future-of-operational-analytics
Robert Yi: Signaling a tectonic shift in the transformation layer
Airbnb’s Minerva metrics
layer, and the recent Looker & Tableau partnership
triggered some exciting conversation
on transformation layer vs. metrics layer. The author narrates the paradigm shift in the transformation layer and how the transformation layer & metrics layer complement each other.
https://robertyi.substack.com/p/signaling-a-tectonic-shift-in-the
Monte Carlo: The Future of the Data Engineer
The conversation is an excellent recap of the current state of data engineering and what the future holds with the fast-changing data tooling landscape. The narration on scalability & cost optimization, consensus & change management in a distributed ownership is an exciting read.
https://www.montecarlodata.com/the-future-of-the-data-engineer/
Monzo: An introduction to Monzo’s data stack
Monzo writes about an overview of its data infrastructure on Google Cloud. The usage of dbt and the wrapper tooling on top of dbt to speed up the execution is an exciting read. It is evident from the blog that one of the most significant challenges of data engineering is the ownership and the contract between the producer & consumers.
https://medium.com/data-monzo/an-introduction-to-monzos-data-stack-827ae531bc99
Petrica Leuca: What is data versioning and 3 ways to implement it
Data versioning is the essence of data pipelines. The authors narrate what data versioning is and three patterns to approach the data versioning.
Kimbal model - SCD pattern
A daily snapshot of the dimension table - functional pattern
CDC pipeline - event sourcing
https://medium.com/@petrica.leuca/what-is-data-versioning-and-3-ways-to-implement-it-4b6377bbdf93
Sponsored: Live Tech Session - The Modern Data Stack Is Warehouse-First
Join leaders from Snowflake, Mammoth Growth, RudderStack, and Mixpanel to learn why the most sophisticated teams architect their data stacks around the data warehouse.
https://rudderstack.com/video-library/the-modern-data-stack-is-warehouse-first
Twitter: Processing billions of events in real-time at Twitter
Twitter writes about its journey on adopting the Kappa architecture pattern and the reasoning for moving away from the Lambda architecture pattern. The blog is an exciting read for the scalability challenges while maintaining the same view for real-time and batch analytics.
Uber: Introducing uGroup: Uber’s Consumer Management Framework
Consumer offset monitoring is critical for operating the streaming applications on top of Apache Kafka. Uber writes about uGroup, a Kafka consumer management framework.
https://eng.uber.com/introducing-ugroup-ubers-consumer-management-framework/
LinkedIn: Project Magnet, providing push-based shuffle, now available in Apache Spark 3.2
LinkedIn writes about Magnet
A push-based shuffle is an implementation of shuffle where the shuffle blocks are pushed to the remote shuffle services from the mapper tasks in the past. The blog narrates an overview of the push-based shuffle, and now it is available as part of Spark 3.2 open source release.
https://engineering.linkedin.com/blog/2021/push-based-shuffle-in-apache-spark
Debezium: Using Debezium to Create a Data Lake with Apache Iceberg.
Apache Iceberg
is an open table format for large analytic datasets. Debezium writes about how the Debezium server can add a new sink connector for creating the Apache Iceberg consumers to capture change data stream.
https://debezium.io/blog/2021/10/20/using-debezium-create-data-lake-with-apache-iceberg/
Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers' opinions.