Data Engineering Weekly #138

The Weekly Data Engineering Newsletter

Jul 10, 2023

Data Engineering Weekly Is Brought to You by RudderStack

RudderStack Profiles takes the SaaS guesswork, and SQL grunt work out of building complete customer profiles, so you can quickly ship actionable, enriched data to every downstream team. See how it works today.

Editors’ Note: Update on DewCon | Data Eng + AI Conference in Bengaluru - October 12th, 2023

We announced the call for speakers only last week, and we are overwhelmed by the response. The call for speakers will be open until the 21st of July. We will start communicating with the selected speakers early next week, So stay tuned!!!

Submit Your Conference Talk Proposal Here

A few of our friends asked me and Aswin what motivated us to launch the DewCon. In the last year, many talented data founders contacted me and discussed their startups and challenges. I even angel-invested in a couple of companies. At the same time, I noticed a forum missing that connects data practitioners, data founders, the decision makers of leading multinational companies, and venture capitalists. DewCon is an attempt to create a forum to facilitate a healthy collaboration from all parts of the data practitioners.

DewCon Sponsorship

We are seeking sponsors for DewCon - 2023 edition to help lower the ticketing price for the data practitioners so that we can open as many developers to attend and participate in the forum. If you’re interested in sponsoring DewCon, please get in touch with us by filling out the form. Every Premium conference speaker will get exclusive coverage in Data Engineering Weekly, so your reachability is not limited to one geographical area.

Hello DewCon Sponsors, Please Click Here and Enter Your Details. We will reach you ASAP.

Andrew Jones: Data Contracts - the book. Out now

Congrats Andrew Jones for the brand new book on Data Contracts - The topic is very close to my heart. I briefly reviewed the book, and it looks like a solid one. I liked the chapter. It talks about how to get adoption in your organization, a sample implementation, and the contract-driven architecture. Thank you for the reference mention of Schemata in the book.

https://andrew-jones.medium.com/data-contracts-the-book-out-now-f456f113dfa4

Capital One: Democratizing machine learning

It is an exciting blog post + video interview from Capital One focusing on the people and technology aspect of democratizing the machine learning practice across the org. The platform approach to enable the citizen machine learning engineers is a great perspective while building both the Data & ML platform.

https://medium.com/capital-one-tech/democratizing-machine-learning-5041f5605b67

Alibaba: The Thinking and Design of a Quasi-Real-Time Data Warehouse with Stream and Batch Integration

Time interval data processing is the foundation of data engineering; regardless it’s batch or real-time. Architectural patterns like Lambda Architecture and Kappa Architecture emerged to bridge the gap between real-time and batch data processing. Each architectural pattern has its limitation. The author discusses through limitations of these architectural patterns and walks us through to find if we can establish a Quasi-Real-Time data warehouse.

https://www.alibabacloud.com/blog/the-thinking-and-design-of-a-quasi-real-time-data-warehouse-with-stream-and-batch-integration_600147

ByteDance: ByteDance Open Sources Its Cloud-Native Data Warehouse ByConity

Seeing how many companies/open-source databases branch out of ClickHouse is amazing. ByteDance open sources ByConity, a scalable cloud-native data warehouse built on top of ClickHouse using FoundationDB for its metadata storage. The blog narrates the optimization and cluster architectural changes by ByteDance on top of ClickHouse.

Figure 4: ByConity internal component interaction diagram

https://byconity.github.io/blog/2023-05-24-byconity-announcement-opensources-its-cloudnative-data-warehouse

RazerPay: Reducing Data Platform Cost by $2M

Cost is on everyone’s top of mind, but achieving a $2M yearly savings is no easy engineering feat. RazerPay writes about its high-level data platform architecture and optimization strategies to reduce platform expenses. I like the simple principle behind the cost-saving mechanism.

Reduce, Remove, Replace and Reuse

https://engineering.razorpay.com/reducing-data-platform-cost-by-2m-d8f82285c4ae

Jack Pullikottil: LLMs, Columnar Lineage & Data Residency

I’ve been thinking deeply about the impact of the Large Language Model in the data warehouse. Data Warehouses are fundamentally an information retrieval system where humans seek knowledge retrieval. So all the data modeling techniques are tuned toward the human understanding of the data and its relationship. LLM turns out pretty good at information retrieval than humans (in theory, as the promise holds bright). It made me wonder why can’t we start data modeling so that machines can understand better, so we can build a better information retrieval system.

The author narrates a similar thought, examining the role of columnar lineage and data residency in the era of LLM.

https://medium.com/@moving-the-needle/llms-columnar-lineage-data-residency-ae06f0418170

Microsoft: Automating data analytics with ChatGPT

Continue our conversation on LLM & Data analytics, Microsoft DataScience group starts with a perfect question.

What if we could teach ChatGPT to leverage such tools and the thought process behind them to analyze problems within specific domains, particularly business analytics?

The author writes an agent as a data engineer and the data scientist with a prompt template to demonstrate automating the data analytics with ChatGPT

https://medium.com/data-science-at-microsoft/automating-data-analytics-with-chatgpt-827a51eaa2c

Square: Lessons Learned From Running Web Experiments: Unveiling key strategies & Frameworks

The Ecosystem Discovery team at Square has shared their insights into effective website testing, focusing on their use of data-driven experimentation for optimization. Key learnings include creating a metric hierarchy and trade-off matrix for clearer decision-making, ensuring accurate segmentation of visitor data, and phasing A/B test traffic to manage risk. The team also emphasized the importance of internal collaboration, investment in automated testing, documentation of best practices, and constructively handling negative test results.

https://developer.squareup.com/blog/lessons-learned-from-running-web-experiments/

Ryan Blue: The CDC MERGE pattern

The author explains a similar problem in Tabular/ Iceberg and the solution to work around it. TBH, I can’t grasp the solution yet, since there are links to previous articles with more context.

If anyone understands how the “Merge Into.” works on Iceberg, Hudi & DeltaLake, please ping me; we love to get on our podcast and discuss it to spread the understanding.

https://tabular.medium.com/the-cdc-merge-pattern-b6f8b564177a

Grab: Zero traffic cost for Kafka consumers.

The cross-AZ network traffic is often the most expensive operation in Kafka infrastructure. In the past, we reduced the producer's cost to ensure the producer writes to the brokers in the same AZ, but consumer reads introduce the cross-AZ network traffic. Kafka 2.3 introduced the ability for consumers to fetch from partition replicas. This opens the door to a more cost-efficient design. Grab writes about its Kafka Consumer design to overcome the cross-AZ network traffic and the impact of it.

https://engineering.grab.com/zero-traffic-cost

Adevinta: Six tried and tested ways to turbocharge Databricks SQL

Regardless of the advancements in the LakeHouse systems, the expert in the loop in the form of a data engineer is inevitable. Adevinta writes about tuning databricks SQL workload by optimizing the storage layout, adopting DeltaLake format, and the effectiveness of the serverless SQL, small file compaction strategy, etc.,

https://medium.com/adevinta-tech-blog/six-tried-and-tested-ways-to-turbocharge-databricks-sql-3a7f000586cb.

All rights reserved ProtoGrowth Inc, India. I have provided links for informational purposes and do not suggest endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.

Data Engineering Weekly