Data Engineering Weekly Is Brought to You by RudderStack
RudderStack Profiles takes the SaaS guesswork, and SQL grunt work out of building complete customer profiles, so you can quickly ship actionable, enriched data to every downstream team. See how it works today.
Editors’ Note: Update on DewCon | Data Eng + AI Conference in Bengaluru - October 12th, 2023
We announced the call for speakers only last week, and we are overwhelmed by the response. The call for speakers will be open until the 21st of July. We will start communicating with the selected speakers early next week, So stay tuned!!!
Submit Your Conference Talk Proposal Here
A few of our friends asked me and Aswin what motivated us to launch the DewCon. In the last year, many talented data founders contacted me and discussed their startups and challenges. I even angel-invested in a couple of companies. At the same time, I noticed a forum missing that connects data practitioners, data founders, the decision makers of leading multinational companies, and venture capitalists. DewCon is an attempt to create a forum to facilitate a healthy collaboration from all parts of the data practitioners.
DewCon Sponsorship
We are seeking sponsors for DewCon - 2023 edition to help lower the ticketing price for the data practitioners so that we can open as many developers to attend and participate in the forum. If you’re interested in sponsoring DewCon, please get in touch with us by filling out the form. Every Premium conference speaker will get exclusive coverage in Data Engineering Weekly, so your reachability is not limited to one geographical area.
Hello DewCon Sponsors, Please Click Here and Enter Your Details. We will reach you ASAP.
Andrew Jones: Data Contracts - the book. Out now
Congrats Andrew Jones for the brand new book on Data Contracts - The topic is very close to my heart. I briefly reviewed the book, and it looks like a solid one. I liked the chapter. It talks about how to get adoption in your organization, a sample implementation, and the contract-driven architecture. Thank you for the reference mention of Schemata in the book.
https://andrew-jones.medium.com/data-contracts-the-book-out-now-f456f113dfa4
Capital One: Democratizing machine learning
It is an exciting blog post + video interview from Capital One focusing on the people and technology aspect of democratizing the machine learning practice across the org. The platform approach to enable the citizen machine learning engineers is a great perspective while building both the Data & ML platform.
https://medium.com/capital-one-tech/democratizing-machine-learning-5041f5605b67
Alibaba: The Thinking and Design of a Quasi-Real-Time Data Warehouse with Stream and Batch Integration
Time interval data processing is the foundation of data engineering; regardless it’s batch or real-time. Architectural patterns like Lambda Architecture and Kappa Architecture emerged to bridge the gap between real-time and batch data processing. Each architectural pattern has its limitation. The author discusses through limitations of these architectural patterns and walks us through to find if we can establish a Quasi-Real-Time data warehouse.
ByteDance: ByteDance Open Sources Its Cloud-Native Data Warehouse ByConity
Seeing how many companies/open-source databases branch out of ClickHouse is amazing. ByteDance open sources ByConity, a scalable cloud-native data warehouse built on top of ClickHouse using FoundationDB for its metadata storage. The blog narrates the optimization and cluster architectural changes by ByteDance on top of ClickHouse.
Sponsored: Atlan AI – The First Copilot for Data Teams
How do you create documentation for 1000s of data assets in minutes? Write SQL queries without learning SQL? Find the right, trusted data by simply asking a question. If you’re searching for an answer, it’s Atlan AI.
Atlan AI leverages metadata that Atlan captures across the data stack to make AI part of your data stack. Now, you can get hours back by letting Atlan AI draft your documentation and write your queries. And you won’t have to ask 3 different people in your team just to find the right, trusted data — you just need your AI copilot.
Watch the Atlan AI demo and join the Atlan AI waitlist →
RazerPay: Reducing Data Platform Cost by $2M
Cost is on everyone’s top of mind, but achieving a $2M yearly savings is no easy engineering feat. RazerPay writes about its high-level data platform architecture and optimization strategies to reduce platform expenses. I like the simple principle behind the cost-saving mechanism.
Reduce, Remove, Replace and Reuse
https://engineering.razorpay.com/reducing-data-platform-cost-by-2m-d8f82285c4ae
Jack Pullikottil: LLMs, Columnar Lineage & Data Residency
I’ve been thinking deeply about the impact of the Large Language Model in the data warehouse. Data Warehouses are fundamentally an information retrieval system where humans seek knowledge retrieval. So all the data modeling techniques are tuned toward the human understanding of the data and its relationship. LLM turns out pretty good at information retrieval than humans (in theory, as the promise holds bright). It made me wonder why can’t we start data modeling so that machines can understand better, so we can build a better information retrieval system.
The author narrates a similar thought, examining the role of columnar lineage and data residency in the era of LLM.
https://medium.com/@moving-the-needle/llms-columnar-lineage-data-residency-ae06f0418170
Sponsored: [New Report] The State of Data Products, 2023 Edition
Data trust is on every data team’s mind, but how do you create and maintain it? To help answer those questions, we surveyed over 200 data teams to benchmark data product adoption rates and how data leaders can improve them. Access the guide today to see how your peers are boosting data adoption and maximizing the return on their data products.
Microsoft: Automating data analytics with ChatGPT
Continue our conversation on LLM & Data analytics, Microsoft DataScience group starts with a perfect question.
What if we could teach ChatGPT to leverage such tools and the thought process behind them to analyze problems within specific domains, particularly business analytics?
The author writes an agent as a data engineer and the data scientist with a prompt template to demonstrate automating the data analytics with ChatGPT
https://medium.com/data-science-at-microsoft/automating-data-analytics-with-chatgpt-827a51eaa2c
Square: Lessons Learned From Running Web Experiments: Unveiling key strategies & Frameworks
The Ecosystem Discovery team at Square has shared their insights into effective website testing, focusing on their use of data-driven experimentation for optimization. Key learnings include creating a metric hierarchy and trade-off matrix for clearer decision-making, ensuring accurate segmentation of visitor data, and phasing A/B test traffic to manage risk. The team also emphasized the importance of internal collaboration, investment in automated testing, documentation of best practices, and constructively handling negative test results.
https://developer.squareup.com/blog/lessons-learned-from-running-web-experiments/
Sponsored: Introducing RudderStack Profiles: Easily Build Complete Customer Profiles in Your Warehouse
With Profiles, you can create features using pre-defined projects in the user interface or a version-controlled config file, and all of the queries and computations are taken care of automatically.
Now every team can build a customer360 in Snowflake with RudderStack Profiles. The new product is a data unification tool that handles the heavy lifting of identity resolution for you, streamlines user feature development, and automatically builds a customer 360 table so that you can ship high-impact projects faster.
Ryan Blue: The CDC MERGE pattern
The author explains a similar problem in Tabular/ Iceberg and the solution to work around it. TBH, I can’t grasp the solution yet, since there are links to previous articles with more context.
If anyone understands how the “Merge Into.” works on Iceberg, Hudi & DeltaLake, please ping me; we love to get on our podcast and discuss it to spread the understanding.
https://tabular.medium.com/the-cdc-merge-pattern-b6f8b564177a
Grab: Zero traffic cost for Kafka consumers.
The cross-AZ network traffic is often the most expensive operation in Kafka infrastructure. In the past, we reduced the producer's cost to ensure the producer writes to the brokers in the same AZ, but consumer reads introduce the cross-AZ network traffic. Kafka 2.3 introduced the ability for consumers to fetch from partition replicas. This opens the door to a more cost-efficient design. Grab writes about its Kafka Consumer design to overcome the cross-AZ network traffic and the impact of it.
https://engineering.grab.com/zero-traffic-cost
Adevinta: Six tried and tested ways to turbocharge Databricks SQL
Regardless of the advancements in the LakeHouse systems, the expert in the loop in the form of a data engineer is inevitable. Adevinta writes about tuning databricks SQL workload by optimizing the storage layout, adopting DeltaLake format, and the effectiveness of the serverless SQL, small file compaction strategy, etc.,
All rights reserved ProtoGrowth Inc, India. I have provided links for informational purposes and do not suggest endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.