Data Engineering Weekly #100
The Weekly Data Engineering Newsletter
Data Engineering Weekly Is Brought to You by RudderStack
RudderStack provides data pipelines that make it easy to collect data from every application, website, and SaaS platform, then activate it in your warehouse and business tools. Sign up free to test out the tool today.
Benn Stancil: If data is a product, what is production?
What is production? Benn raised a question if we can’t define it, we can’t meaningfully build it. The best answer to that is,
The thread triggered many interesting discussions around the need for production, and I found David Jayatillake came up with an excellent question.
Mehdi Ouazza: Data Contracts — From Zero To Hero
Data Production, Data is a Product, and all these higher level concepts ultimately build on top of "Data Contracts." I wrote my version of the Data Contract initiative called "Schemata," a collaborative, decentralized data contract management system.
Data Contracts will be the next big step in the modern data stack to empower data as a strategic advantage.
But what is Data Contract? Where to start? Mehdi writes an excellent guide about data contracts.
BlaBlaCar: Do’s and Don’ts of Data Mesh
The Data Mesh principle builds on four core principles, domain-oriented decentralized data ownership & architecture, data as a product, self-serve data infrastructure as a platform, and federated computational governance. The founding element in Data Mesh is no surprise is the Data Contracts. BlaBlaCar writes about Dos & Don'ts while adopting Data Mesh.
Yali Sassoon: Organizations need to deliberately create data
Can an organization deliberately create data to power machine learning and advance analytical use cases? The current state is that every ML product goes through the discovery process, which finds missing data/sources, resulting in a data extraction process. I agree with the author's view, and as an industry, we have a long way to go to build what I call an "Analytical Ready Data Asset Creation."
Sponsored: Firebolt - Assembling a Query Engine From Spare Parts
Building a new data warehouse is a daunting challenge. It requires massive investments into both the query engine and surrounding cloud infrastructure. Mosha Pasumansky and Benjamin Wagner wrote a paper about it which explains how Firebolt engineers built a query engine on top of existing projects and invested heavily into differentiating features.
McDonald’s: McDonald’s event-driven architecture - The data journey and how it works
Mcdonald's writes about how it enforces data contracts between the producer and consumer. The domain-based sharing is an exciting approach to bringing multi-tenant event streaming to isolate failure modes.
Flat Pack Tech: IKEA’s Knowledge Graph and Why It Has Three Layers
IKEA writes about its knowledge graph system and the layers of knowledge graphs. The three-layer approach focuses on
Concepts: represent the business concepts of what the company does.
Categories: represents a controlled vocabulary or terminology used within the business.
Data: represent the product or unique selling unit of a business.
Sponsored: Soda - 💟 Data Quality Checks in Airflow DAGs with Astronomer + Soda
For reliable data pipelines-as-code, check out the fresh, easy-to-follow, step-by-step guide to setting up and integrating open-source data quality checks with Apache Airflow.
Instacart: Lessons Learned - The Journey to Real-Time Machine Learning at Instacart
Instacart writes about lessons learned from its journey to build real-time machine learning applications highlighting critical challenges with the infrastructure. I think the real-time feature store is a unique problem waiting to be disturbed.
Grab: Automatic rule backtesting with large quantities of data
Simulation analysis is an exciting area of study, and delighted to read about how Grub enables automatic rule backtesting with historical data. The replay system to simulate the rule changes or any proposed rule change is exciting to read.
Sponsored: Rudderstack - Better Customer Data Integration Management For Growing Teams
In this piece, Ben Rogojan outlines your options for solving data integration challenges as your company grows: building a scalable framework or architecting a stack with the right tools. Check it out for some practical advice on which approach to take.
Slim Baltagi: Snowflake Performance Challenges & Solutions
We've seen thoughts on Cloud Data Warehouse cost, from Why Snowflake so expensive? to How Snowflake fails, the customer's love for Snowflake. I thought this article better narrates the misconception from marketing to the reality of operating the Cloud data warehouse system. I like how the author constructively narrates critical aspects of Snowflake and the suggestion to improve.
Claimsforce: Lakehouse — The journey unifying Data Lake and Data Warehouse
The emerging LakeHouse systems narrow the gap between the data warehouses and data lakes systems on top of object stores. Historically the two-state system with a mix of the data lake and data warehouse act as a bridge to balance the scale and speed. Claimforce writes an exciting blog that narrates its journey to unify data lake and data warehouse with delta lake.
Presto Tech Talk: Apache Hudi for Presto and how it's used at Bytedance
One of the shortcomings in Claimforce’s article highlights the lack of support from Athena (AWS version of Presto) with Delta Lake. Bytedance talks about how Presto works with Apache Hudi, another leading LakeHouse system.
Xinli Shang: Presto Parquet Column Encryption
Access Control is a challenging part of data management at scale. Uber writes about how it did achieve finer-grained encryption with Apache Parquet in the past.
The blog narrates the challenges and performance benchmarks of adopting parquet encryption in Presto.
All rights reserved Pixel Impex Inc, India. Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.