Data Engineering Weekly #128

The Weekly Data Engineering Newsletter

Apr 24, 2023

Data Engineering Weekly Is Brought to You by RudderStack

RudderStack provides data pipelines that make collecting data from every application, website, and SaaS platform easy, then activating it in your warehouse and business tools. Sign up free to test out the tool today.

Editor’s Note: 🚀🌐🎉 Calling all Data Engineering Weekly Readers! 🎉🌐🚀

Data Engineering Weekly is joining forces with Rudderstack and The Data Stack Show to bring you The State of Data Engineering Survey 2023! 📊

⏱️Got 5 minutes? ⏱️ Lend us a hand and share your valuable insights on:

🎯 Data team priorities for 2023
👥 Team dynamics
🛠️ Data stacks
🔍 Identity resolution strategies
📚 Data team roles

Please help us create a comprehensive report that will be featured in Data Engineering Weekly 🗞️. Plus, I'll hop on a special episode of The Data Stack Show to discuss the results 🎙️.

And the cherry on top? 🍒 We'll send exclusive The Data Stack Show swag just for participating! 🎁

Take the survey now! ➡️ RudderStack.com/survey ⬅️

✍🏼 dbt: The next big step forwards for analytics engineering

A lot of exciting announcements from dbt this week on the next big step forwards for analytical engineering. Yes, dbt brings the native experience of data contract and domain ownership into the data transformation layer. I applauded dbt team for this step since this is the community's first opinionated take on data transformation that I’ve seen in a long time. Considering the dbt’s large community and widespread adoption, they are entitled and obligated to show the path toward a productive data transformation process.

I believe we have reached a point in the industry where we no longer need to explain what is data contract is and why it is beneficial. However, the data contract should start with the developers and the data practitioners during the data creation process.

I open-sourced Schemata last year, the industry's first Data Contract as a Code (DCC). We are working on some exciting data contract implementation on top of Schemata. I love to show you what we are working on and get initial feedback before opensource the solution. Please add your contact details and love to chat with you all. Together, let's build a state-of-the-art data contract system. 💪🏽

Click the link to connect 🖇️ https://forms.gle/WT4AQsUPFLpMNCyk7

🔗 https://www.getdbt.com/blog/analytics-engineering-next-step-forwards/

✍🏼Replit: How to train your own Large Language Models

Replit writes about building your own Large Language Model using Databricks, Hugging Face, and MosaicML. The highlight of the blog for me is

LLMs require an immense amount of data to train. Training them requires building robust data pipelines that are highly optimized yet flexible enough to easily include new public and proprietary data sources.

As the adoption of LLM increases, the need for robust data pipeline increases. It is an excellent time to be in data engineering.

🔗 https://blog.replit.com/llm-training

✍🏼Medium: Building a ChatGPT Plugin for Medium

OpenAI recently announced the support of plugins for ChatGPT. Content platforms like Medium can expose their content to ChatGPT; based on the user prompt and installed plugins, ChatGPT can trigger the correct API of your plugins to retrieve a piece of content and do some manipulation on it. Medium writes about how to write a ChatGPT plugin and debug and deploy the application.

🔗 https://medium.engineering/building-a-chatgpt-plugin-for-medium-6813b59e4b24

✍🏼Dr. Varshita Sher: A Gentle Intro to Chaining LLMs, Agents, and Utils via LangChain

As I continued my quest to learn more about LLM and its infrastructure, the author explained LLM, Agents, and Utils via LangChain pretty well. The LangChain documentation around the types of chains is an excellent read to get to know the LLM chain pattern more in-depth.

🔗 https://towardsdatascience.com/a-gentle-intro-to-chaining-llms-agents-and-utils-via-langchain-16cd385fca81

✍🏼Instacart: Building a Flink Self-Serve Platform on Kubernetes at Scale

One of the significant challenges of running an EMR cluster is it is often an isolated environment from the rest of the organization's computing infrastructure. Running data engineering workload on Kubernetes, even as one pod per node model, helps share the organization's tools and infrastructure. The author writes about how Instacart builds Flink as a Service on the top of Kubernetes instead of EMR.

🔗https://tech.instacart.com/building-a-flink-self-serve-platform-on-kubernetes-at-scale-c11ef19aef10

✍🏼Captal One Tech: Data Profiler - Data Drift Model Monitoring Tool

Data drift is a process of detecting the anomaly of changes in the characteristics of each metric over time, such as their average, how spread out they are, and how often they occur. Capital One open-sourced Data Profiler, a standalone library to detect data drifting. In this blog, the author expands on Data Profiler and narrates how it integrated with KubeFlow to build Data Drift Detection as a Service.

🔗https://medium.com/@CapitalOneTech/data-profiler-data-drift-model-monitoring-tool-capital-one-e69631f5a058

✍🏼Data Engineering at Adyen

For the first time, Adyen reveals essential aspects of data engineering, providing insight into how the role fits seamlessly within the larger data landscape. The blog delves into the diverse components of the job, emphasizing the importance of teamwork and collaboration in attaining success in this domain.

🔗https://medium.com/adyen/data-engineering-at-adyen-ccded12a6eb

✍🏼Wix: A Comprehensive Approach to Efficient Data Engineering

Wix publishes video’s from its Wix Data Engineering meetups (or should we call it as low-key events?) that focuses on optimizing Spark with Iceberg, ensuring data quality with great expectations, and elevating code review practices using game theory.

🔗https://medium.com/wix-engineering/a-comprehensive-approach-to-efficient-data-engineering-f9e5ff4b967f

✍🏼Razorpay: Real-Time Denormalized Data Streaming Platform

One challenge of sourcing a change stream from the operational store is that we have to denormalize it at some point. The best place to do the denormalization is at the source itself; otherwise, we will spend expensive real-time streaming join to denormalize the structure. RazerPay writes about the challenges in denormalization in real-time and the lessons learned along the line.

🔗Part 1: https://engineering.razorpay.com/real-time-denormalized-data-streaming-platform-part-1-9f3c730dd9c6

🔗Part 2: https://engineering.razorpay.com/real-time-denormalized-data-streaming-platform-part-2-97dfff40fd8d

✍🏼SEEK: 10 Lessons from Building an Experimentation Platform

Experimentation platform brings its own challenges where usually the experimentation pipeline is the last pipeline runs and aggregate metrics :-) The author shares the top 10 lessons from building an experimentation platform.

Ensure data is fit for purpose
Keep complex data transformations outside the experimentation platform
Start with simple statistical methodology
Recognise that methods for understanding anomalies in the data are more important than advanced statistical techniques
Consider the impacts of outliers / extreme observations
Focus on early stopping techniques before variance reduction
Consider scalability right from the start
Cache EVERYTHING
Use task parallelisation wherever possible
Consider both scheduled and ad hoc analysis use cases — but don’t build one system for both

🔗https://medium.com/seek-blog/10-lessons-from-building-an-experimentation-platform-ded851715683

All rights reserved ProtoGrowth Inc, India. I have provided links for informational purposes and do not suggest endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.

Data Engineering Weekly