Data Engineering Weekly #100

The Weekly Data Engineering Newsletter

Sep 12, 2022

Data Engineering Weekly Is Brought to You by RudderStack

RudderStack provides data pipelines that make it easy to collect data from every application, website, and SaaS platform, then activate it in your warehouse and business tools. Sign up free to test out the tool today.

Benn Stancil: If data is a product, what is production?

What is production? Benn raised a question if we can’t define it, we can’t meaningfully build it. The best answer to that is,

Josh Wills @josh_wills

@bennstancil A data artifact is "in production" if a) it has a SLA and data quality checks that alert when they are violated, and b) all of its upstream sources have a data contract (which I define as a validated schema + data quality checks) which *also* alert when they are violated.

The thread triggered many interesting discussions around the need for production, and I found David Jayatillake came up with an excellent question.

David Jayatillake @DSJayatillake

"A financial analyst who’s building an account health model asks for an adjusted set of metrics that they think will be useful predictors of customer churn." How to support this without it entering production unintentionally. 🧵 ⁦@bennstancil⁩ benn.substack.com/p/what-is-prod…

benn.substack.comIf data is a product, what is production?We can’t build something unless we know what it is.

https://benn.substack.com/p/what-is-production

Mehdi Ouazza: Data Contracts — From Zero To Hero

Data Production, Data is a Product, and all these higher level concepts ultimately build on top of "Data Contracts." I wrote my version of the Data Contract initiative called "Schemata," a collaborative, decentralized data contract management system.

Data Contracts will be the next big step in the modern data stack to empower data as a strategic advantage.

But what is Data Contract? Where to start? Mehdi writes an excellent guide about data contracts.

https://towardsdatascience.com/data-contracts-from-zero-to-hero-343717ac4d5e

BlaBlaCar: Do’s and Don’ts of Data Mesh

The Data Mesh principle builds on four core principles, domain-oriented decentralized data ownership & architecture, data as a product, self-serve data infrastructure as a platform, and federated computational governance. The founding element in Data Mesh is no surprise is the Data Contracts. BlaBlaCar writes about Dos & Don'ts while adopting Data Mesh.

https://medium.com/blablacar/dos-and-don-ts-of-data-mesh-e093f1662c2d

Yali Sassoon: Organizations need to deliberately create data

Can an organization deliberately create data to power machine learning and advance analytical use cases? The current state is that every ML product goes through the discovery process, which finds missing data/sources, resulting in a data extraction process. I agree with the author's view, and as an industry, we have a long way to go to build what I call an "Analytical Ready Data Asset Creation."

https://datacreation.substack.com/p/organizations-need-to-deliberately

McDonald’s: McDonald’s event-driven architecture - The data journey and how it works

Mcdonald's writes about how it enforces data contracts between the producer and consumer. The domain-based sharing is an exciting approach to bringing multi-tenant event streaming to isolate failure modes.

Part 1: https://medium.com/mcdonalds-technical-blog/behind-the-scenes-mcdonalds-event-driven-architecture-51a6542c0d86

Part 2: https://medium.com/mcdonalds-technical-blog/mcdonalds-event-driven-architecture-the-data-journey-and-how-it-works-4591d108821f

Flat Pack Tech: IKEA’s Knowledge Graph and Why It Has Three Layers

IKEA writes about its knowledge graph system and the layers of knowledge graphs. The three-layer approach focuses on

Concepts: represent the business concepts of what the company does.
Categories: represents a controlled vocabulary or terminology used within the business.
Data: represent the product or unique selling unit of a business.

https://medium.com/flat-pack-tech/ikeas-knowledge-graph-and-why-it-has-three-layers-a38fca436349

Instacart: Lessons Learned - The Journey to Real-Time Machine Learning at Instacart

Instacart writes about lessons learned from its journey to build real-time machine learning applications highlighting critical challenges with the infrastructure. I think the real-time feature store is a unique problem waiting to be disturbed.

https://tech.instacart.com/lessons-learned-the-journey-to-real-time-machine-learning-at-instacart-942f3a656af3

Grab: Automatic rule backtesting with large quantities of data

Simulation analysis is an exciting area of study, and delighted to read about how Grub enables automatic rule backtesting with historical data. The replay system to simulate the rule changes or any proposed rule change is exciting to read.

https://engineering.grab.com/automatic-rule-backtesting

Slim Baltagi: Snowflake Performance Challenges & Solutions

We've seen thoughts on Cloud Data Warehouse cost, from Why Snowflake so expensive? to How Snowflake fails, the customer's love for Snowflake. I thought this article better narrates the misconception from marketing to the reality of operating the Cloud data warehouse system. I like how the author constructively narrates critical aspects of Snowflake and the suggestion to improve.

https://www.linkedin.com/pulse/snowflake-performance-challenges-solutions-part-1-slim-baltagi/

Claimsforce: Lakehouse — The journey unifying Data Lake and Data Warehouse

The emerging LakeHouse systems narrow the gap between the data warehouses and data lakes systems on top of object stores. Historically the two-state system with a mix of the data lake and data warehouse act as a bridge to balance the scale and speed. Claimforce writes an exciting blog that narrates its journey to unify data lake and data warehouse with delta lake.

https://medium.com/claimsforce/lakehouse-the-journey-unifying-data-lake-and-data-warehouse-bef7629c143a

Presto Tech Talk: Apache Hudi for Presto and how it's used at Bytedance

One of the shortcomings in Claimforce’s article highlights the lack of support from Athena (AWS version of Presto) with Delta Lake. Bytedance talks about how Presto works with Apache Hudi, another leading LakeHouse system.

Xinli Shang: Presto Parquet Column Encryption

Access Control is a challenging part of data management at scale. Uber writes about how it did achieve finer-grained encryption with Apache Parquet in the past.

One Stone, Three Birds: Finer-Grained Encryption @ Apache Parquet™

The blog narrates the challenges and performance benchmarks of adopting parquet encryption in Presto.

https://prestodb.io/blog/2022/07/10/presto-parquet-column-encryption.html

All rights reserved Pixel Impex Inc, India. Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.

Data Engineering Weekly