RudderStack is the Warehouse Native CDP, built to help data teams deliver value across the entire data activation lifecycle, from collection to unification and activation. Visit rudderstack.com to learn more.
Editor’s Note: DEWCon Next?
Aswin and I started DEWCon as a fun experiment. Midway through organizing the conference, fear washed over us. Had we bitten off more than we could chew? Was this simply too ambitious? We have no sponsors. Our hope is only with the amazing community of data practitioners who constantly support us.
One thing I learned while writing Data Engineering Weekly is that persistence and consistency are the keys to success. Eventually, we got some amazing sponsors who supported us to cover the venue cost!!! As soon as we opened the registration, we got oversubscribed. We had to close the conference registration five days before because our venue partner told us they couldn't support more people.
The community gave us overwhelming support, and all the questions we got were: when are you going to do next year? With all the encouragement, we thought, how about taking one step further? How about bringing DEWCon to Europe, too?
Like before, We are scared. We have no idea about running a conference in Europe. So, we want your support in understanding the landscape by sharing your insights in the poll. Please add your thoughts if you live in Europe.
If you have any other choice of the city or would like to share your thoughts, please add them to the comments.
Benn Stancil: The problem was the product - How the modern data stack got lost
The week is full of retrospect for the Modern Data Stack :-) The author shares a unique point of view on how the lack of collaboration among the MDS companies leads to its loss of popularity.
Modern data stack vendors chose speed, and never attempted to truly build something together. But most partnerships were temporary and transactional agreements of mutual convenience that went co-sponsored afterparty deep.
It is true. At the same time, in a growth-driven economy, it is simply not possible. If you don’t grow, you don’t exist. So, it is an eventual reality that one steps into another product.
https://benn.substack.com/p/the-problem-was-the-product
Joe Reis: Everything Ends - My Journey With the Modern Data Stack
Joe writes another excellent retrospect for Modern Data Stack, walking down memory lane of the early and golden days of the Modern Data Stack. One can’t deny the role of Redshift in bringing the cloud data warehouse to the masses, starting the end of the Big Data era with Hadoop.
All the retrospect keeps me wondering, So What is Next? We are so over the Big Data Era to Modern Data Stack. What are we stepping into?
https://joereis.substack.com/p/everything-ends-my-journey-with-the
Sponsored: Data modeling and exploration in Playground 2.0
Learn about Cube, the universal semantic layer, in an upcoming technical webinar. See how recent product updates make it easier for developers and data engineers to debug data models, work in a BI-like UI, and fine-tune deployments for better query performance.
Register for our webinar to explore Cube Cloud and learn about the convenient UI for easier data modeling. Elevate your data skills!
https://bit.ly/playground-2-webinar
Mikkel Dengsøe: Data ownership - A practical guide
Data Ownership is the fundamental construct to build reliable data engineering practices. A recent survey from dbt shows that 43% of the respondents answered that ambiguous data ownership was their biggest challenge. The author writes about the basic construct of data ownership and how to utilize dbt to reduce friction.
I believe the data ownership problem is much deeper than simple metadata management. Data Ownership is the fundamental construct for Data Products. In a recent poll, 75% of data practitioners say either they are implementing or are thinking of implementing Data Products.
Yet, there is no tool for the data producers to Define, Design, Collaborate, Test, & Continuously Monitor the Data Products. The current tools are fragmented, which further complicates the Data Producers, leads them to failure in Data Ownership.
https://medium.com/@mikldd/data-ownership-a-practical-guide-ae306d49866f
Abhijeet Rathod: Rust: The Rising Star in Data Engineering 🦀 Embracing the Power of Performance, Safety, and Elegance
We started to see more and more tools, especially around Python rewritten in Rust, with much better performance. uv, the Python package tool alternative to pip and poetry, is the recent addition to the trend. The author summarizes why Rust is gaining momentum in data engineering by pointing out some of the upcoming tools written in Rust.
Sponsored: Data quality best practices - Bridging the dev data divide
"Just like the data team, development teams are under pressure to work quickly and efficiently to accomplish their goals. It’s not like development teams purposely make life difficult for their data team counterparts. They’re just doing their jobs, and their incentives are, by nature, different from yours."
Here, the team at RudderStack looks at the divide between data producers and consumers. They give a clear explanation for why it exists, and they detail four principles you can follow to bridge the gap. The article concludes with a look at data contracts as a concrete example of these principles in practice.
https://www.rudderstack.com/blog/data-quality-best-practices-bridging-the-dev-data-divide/
Noah Kennedy: Reducing BigQuery Costs by 100–200x with dbt Incremental Models
One of the repeating patterns I heard in many of the data engineers is how to set up incremental data processing across the system efficiently. The author writes about how the dbt incremental model helps them reduce the BigQuery cost 100-200X!!
Financial Times: Turning ideas into AI use cases - the Product Manager's point of view
How do we integrate Gen-AI into the existing workflow? What is the real business impactful use case? It is a burning question for every product manager in the world now. The authors share the top 5 lessons learned in turning ideas into AI use cases.
LinkedIn: Building a Large-Scale Recommendation System: People You May Know
The People You May Know feature is probably one of LinkedIn's best growth hacking techniques to build the network effect. The blog narrates how LinkedIn built and scaled a large-scale recommendation system to handle over a billion items while ensuring high relevance and low serving latency.
DoorDash: Experiment Faster and with Less Effort
DoorDash writes about fractional factorial designs to test multiple hypotheses simultaneously, reducing the number of necessary experiments. Fractional factorial design selects a subset of the possible combinations of factors to run as experiments. The research paper Business Policy Experiments using Fractional Factorial Designs: Consumer Retention on DoorDash is an excellent read to dive deep into.
https://doordash.engineering/2024/02/13/experiment-faster-and-with-less-effort/
Expedia: Powering ML Platform Orchestration and Experimentation
Expedia writes about its ML Platform Orchestrator, its design, and some early successes with applying this approach to accelerate ML model experimentation. The design focuses on reusability, ease, and a low-code approach. Every orchestration engine becomes efficient only if it seamlessly integrates with the existing infrastructure. Expedia shares how integrating with its experimentation platform helps productivity gains.
All rights reserved ProtoGrowth Inc, India. I have provided links for informational purposes and do not suggest endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employer” opinions.