Data Engineering Weekly #128
The Weekly Data Engineering Newsletter
Data Engineering Weekly Is Brought to You by RudderStack
RudderStack provides data pipelines that make collecting data from every application, website, and SaaS platform easy, then activating it in your warehouse and business tools. Sign up free to test out the tool today.
Editor’s Note: 🚀🌐🎉 Calling all Data Engineering Weekly Readers! 🎉🌐🚀
Data Engineering Weekly is joining forces with Rudderstack and The Data Stack Show to bring you The State of Data Engineering Survey 2023! 📊
⏱️Got 5 minutes? ⏱️ Lend us a hand and share your valuable insights on:
🎯 Data team priorities for 2023
👥 Team dynamics
🛠️ Data stacks
🔍 Identity resolution strategies
📚 Data team roles
Please help us create a comprehensive report that will be featured in Data Engineering Weekly 🗞️. Plus, I'll hop on a special episode of The Data Stack Show to discuss the results 🎙️.
And the cherry on top? 🍒 We'll send exclusive The Data Stack Show swag just for participating! 🎁
Take the survey now! ➡️ RudderStack.com/survey ⬅️
✍🏼 dbt: The next big step forwards for analytics engineering
A lot of exciting announcements from dbt this week on the next big step forwards for analytical engineering. Yes, dbt brings the native experience of data contract and domain ownership into the data transformation layer. I applauded dbt team for this step since this is the community's first opinionated take on data transformation that I’ve seen in a long time. Considering the dbt’s large community and widespread adoption, they are entitled and obligated to show the path toward a productive data transformation process.
I believe we have reached a point in the industry where we no longer need to explain what is data contract is and why it is beneficial. However, the data contract should start with the developers and the data practitioners during the data creation process.
I open-sourced Schemata last year, the industry's first Data Contract as a Code (DCC). We are working on some exciting data contract implementation on top of Schemata. I love to show you what we are working on and get initial feedback before opensource the solution. Please add your contact details and love to chat with you all. Together, let's build a state-of-the-art data contract system. 💪🏽
Click the link to connect 🖇️ https://forms.gle/WT4AQsUPFLpMNCyk7
✍🏼Replit: How to train your own Large Language Models
Replit writes about building your own Large Language Model using Databricks, Hugging Face, and MosaicML. The highlight of the blog for me is
LLMs require an immense amount of data to train. Training them requires building robust data pipelines that are highly optimized yet flexible enough to easily include new public and proprietary data sources.
As the adoption of LLM increases, the need for robust data pipeline increases. It is an excellent time to be in data engineering.
✍🏼Medium: Building a ChatGPT Plugin for Medium
OpenAI recently announced the support of plugins for ChatGPT. Content platforms like Medium can expose their content to ChatGPT; based on the user prompt and installed plugins, ChatGPT can trigger the correct API of your plugins to retrieve a piece of content and do some manipulation on it. Medium writes about how to write a ChatGPT plugin and debug and deploy the application.
✍🏼Dr. Varshita Sher: A Gentle Intro to Chaining LLMs, Agents, and Utils via LangChain
As I continued my quest to learn more about LLM and its infrastructure, the author explained LLM, Agents, and Utils via LangChain pretty well. The LangChain documentation around the types of chains is an excellent read to get to know the LLM chain pattern more in-depth.
Sponsored: [Virtual Data Panel] Measuring Data Team ROI
As data leaders, one of our top priorities is to measure ROI. From tracking the efficacy of marketing campaigns to understanding the root cause of new spikes in user engagement, we’re tasked with keeping tabs on the business's health at all levels. But what about the ROI of our own teams? Watch a panel of data leaders as they discuss how to build strategies for measuring data team ROI.
✍🏼Instacart: Building a Flink Self-Serve Platform on Kubernetes at Scale
One of the significant challenges of running an EMR cluster is it is often an isolated environment from the rest of the organization's computing infrastructure. Running data engineering workload on Kubernetes, even as one pod per node model, helps share the organization's tools and infrastructure. The author writes about how Instacart builds Flink as a Service on the top of Kubernetes instead of EMR.
✍🏼Captal One Tech: Data Profiler - Data Drift Model Monitoring Tool
Data drift is a process of detecting the anomaly of changes in the characteristics of each metric over time, such as their average, how spread out they are, and how often they occur. Capital One open-sourced Data Profiler, a standalone library to detect data drifting. In this blog, the author expands on Data Profiler and narrates how it integrated with KubeFlow to build Data Drift Detection as a Service.
Sponsored: Warehouse-first analytics and experimentation with RudderStack and Eppo
Find out how Phantom transitioned from siloed analytics to a warehouse-first stack that enables A/B experimentation directly on top of the data warehouse. You'll learn from Eppo founder Chetan Sharma, RudderStack DevRel leader Sara Mashfej, and Phantom Senior Data Engineer Ricardo Pinho.
✍🏼Data Engineering at Adyen
For the first time, Adyen reveals essential aspects of data engineering, providing insight into how the role fits seamlessly within the larger data landscape. The blog delves into the diverse components of the job, emphasizing the importance of teamwork and collaboration in attaining success in this domain.
✍🏼Wix: A Comprehensive Approach to Efficient Data Engineering
Wix publishes video’s from its Wix Data Engineering meetups (or should we call it as low-key events?) that focuses on optimizing Spark with Iceberg, ensuring data quality with great expectations, and elevating code review practices using game theory.
✍🏼Razorpay: Real-Time Denormalized Data Streaming Platform
One challenge of sourcing a change stream from the operational store is that we have to denormalize it at some point. The best place to do the denormalization is at the source itself; otherwise, we will spend expensive real-time streaming join to denormalize the structure. RazerPay writes about the challenges in denormalization in real-time and the lessons learned along the line.
🔗Part 1: https://engineering.razorpay.com/real-time-denormalized-data-streaming-platform-part-1-9f3c730dd9c6
🔗Part 2: https://engineering.razorpay.com/real-time-denormalized-data-streaming-platform-part-2-97dfff40fd8d
✍🏼SEEK: 10 Lessons from Building an Experimentation Platform
Experimentation platform brings its own challenges where usually the experimentation pipeline is the last pipeline runs and aggregate metrics :-) The author shares the top 10 lessons from building an experimentation platform.
Ensure data is fit for purpose
Keep complex data transformations outside the experimentation platform
Start with simple statistical methodology
Recognise that methods for understanding anomalies in the data are more important than advanced statistical techniques
Consider the impacts of outliers / extreme observations
Focus on early stopping techniques before variance reduction
Consider scalability right from the start
Use task parallelisation wherever possible
Consider both scheduled and ad hoc analysis use cases — but don’t build one system for both
All rights reserved ProtoGrowth Inc, India. I have provided links for informational purposes and do not suggest endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.