Data Engineering Weekly #127

The Weekly Data Engineering Newsletter

Apr 17, 2023

Data Engineering Weekly Is Brought to You by RudderStack

RudderStack provides data pipelines that make collecting data from every application, website, and SaaS platform easy, then activating it in your warehouse and business tools. Sign up free to test out the tool today.

Editor’s Note: 🚀🌐🎉 Calling all Data Engineering Weekly Readers! 🎉🌐🚀

Data Engineering Weekly is joining forces with Rudderstack and The Data Stack Show to bring you The State of Data Engineering Survey 2023! 📊

⏱️Got 5 minutes? ⏱️ Lend us a hand and share your valuable insights on:

🎯 Data team priorities for 2023
👥 Team dynamics
🛠️ Data stacks
🔍 Identity resolution strategies
📚 Data team roles

Please help us create a comprehensive report that will be featured in Data Engineering Weekly 🗞️. Plus, I'll hop on a special episode of The Data Stack Show to discuss the results 🎙️.

And the cherry on top? 🍒 We'll send exclusive The Data Stack Show swag just for participating! 🎁

Take the survey now! ➡️ RudderStack.com/survey ⬅️

Chip Huyen: Building LLM applications for production

The article is one of the best reads of 2023 for me. I print this out and read it a couple of times. The blog narrates the difference between deterministic and non-deterministic system design and the challenges of building LLM applications in production. The author explains the difference between prompting and fine-tuning and when to choose what.

https://huyenchip.com/2023/04/11/llm-engineering.html

The flow control in the LLM application is an exciting read, and a generalized programming model will emerge soon. The speed of innovation in this space is mind-blowing when I came across Microsoft Semantic Kernal.

Microsoft Semantic Kernal, a lightweight SDK that lets you easily mix conventional programming languages with the latest in Large Language Model (LLM) AI "prompts" with templating, chaining, and planning capabilities out-of-the-box.

https://learn.microsoft.com/en-us/semantic-kernel/whatissk

Matt Palmer: Hot Takes on the Modern Data Stack

Wow 🙌🏼 I applaud 👏🏽 author’s thought process on writing the hot take and kept it hot indeed 😊 The hot takes are

dbt lacks some basic functionality expected of a best-in-class tool
Deploying a “production” data warehouse is unnecessarily hard and gated by tribal knowledge.
Redshift is no longer a true competitor in the warehouse space.
Airflow is obsolete.
Airbyte is not production-grade software.

#1 On dbt; Yes, I completely agree with the author’s take that it lacks the basic functionality expected of a best-in-class tool. For me personally, the fact that dbt doesn’t have a backfill mechanism, no easy option to build a date-time partition table, and always requires relying on Airflow as a scheduler engine is 🤯 Benn Stancil recently wrote about the dbt dilemma as peacetime dbt vs. war time dbt in the article, How dbt succeeds.

#2 I agree; permission is a mess. However, as the data stack consolidation accelerates, we will see a much better permission model from Snowflake and Databricks.

#3 Yes, AWS, please #SaveRedshift

#4 On Airflow, hmmm still trying to figure it out. Do I like Airflow? No, but the current state of mind is that a known devil is better than an unknown devil.

#5. I’ve not tried Airbyte yet but I heard a similar sentiment from other folks.

https://mattpalmer.io/posts/hot-takes/

DoorDash: Using Metrics Layer to Standardize and Scale Experimentation at DoorDash

It is exciting to see wide adoption and implementation study of the Metric Layer from various companies. DoorDash writes about its journey to build a Metric Layer for experimentation. One of the things that stands out for me in the article is the debate around pre-aggregation vs. lazy aggregation. While reading Airbnb’s Minerva blog about XRF[ eXecutive Reporting Framework] that aggregate, I can’t stop wondering how expensive this system will be.

As someone who contributed early to Apache Pinot, I believe, as the author highlighted in the blog, that Apache Pinot or a Pinot-like system would play a significant role in the metrics layer.

https://doordash.engineering/2023/04/12/using-metrics-layer-to-standardize-and-scale-experimentation-at-doordash/

Meta: The Future of the data engineer — Part I

Meta introduced a new term, “Accessible Analytics,” - self-describing to the extent that it doesn’t require specialized skills to draw meaningful insights from it.

Meta shares its ever-changing landscape of data engineering. The author narrates how data engineering focused more on data integration at the initial stage and how it shifted towards building Accessible Analytics. Analytical Engineering is the role often quoted for data engineers who do Accessible Analytics. Apart from marketing terms, it is fascinating to see how the role changes along with the maturity of an organization.

https://medium.com/@AnalyticsAtMeta/the-future-of-the-data-engineer-part-i-32bd125465be

HelloFresh: Enabling teams access to their data with a low-code ETL tool

HelloFresh writes about a self-serving low-code ETL tool that enables the analytical team to self-import the data sources. The abstraction builds on a known data model with slowly changing dimensions.

https://engineering.hellofresh.com/enabling-teams-access-to-their-data-with-a-low-code-etl-tool-a7bc5fc2457a

Niels Claeys: Use dbt and Duckdb instead of Spark in data pipelines

AWS u-24tb1.metal instances offering a 24TiB memory 💾💾💾 and the advancement of RDMA bringing memory-centric data engineering. DuckDB, a leading in-memory columnar db, is an exciting system to watch out for. The author writes one such case using dbt & DuckDb for the analytical workload instead of Apache Spark.

https://medium.com/datamindedbe/use-dbt-and-duckdb-instead-of-spark-in-data-pipelines-9063a31ea2b5

Jacopo Tagliabue: “Is This You?” Entity Matching in the Modern Data Stack with Large Language Models

Entity Resolution is the fundamental challenge in the Data Integration system. Traditionally we tried or still solving using MDM (Master Data Management) systems. The CDP (Customer Data Platform) pushed away folks from Entity resolution since the use cases like marketing campaign don’t requires strict survivorship rules. The author takes the Entity Matching problem and describes how GPT-3 with Modern Data Stack (MDS) works.

I am profoundly interested in Master Data Management, and you can expect more guest articles and interviews in Data Engineering Weekly shortly!!!

https://towardsdatascience.com/is-this-you-entity-matching-in-the-modern-data-stack-with-large-language-models-19a730373b26

Diogo Silva Santos: Why Data Debt Is the Next Technical Debt You Need to Worry About

Data debt is the cost of avoiding or delaying investment in maintaining, updating, or managing data assets, leading to decreased efficiency, increased costs, and potential risks.

Excellent narration of data debt (Surprisingly, this is my first time hearing this term). The author narrates some of the root causes of data debt, and one of the primary causes of data debt is? You guessed it correctly. "No proper data contracts/collaboration between software and data engineers”

https://medium.com/@diogo22santos/why-data-debt-is-the-next-technical-debt-you-need-to-worry-about-bc1731f732ff

Shri Salem: Identifying data-driven use cases with a value driver tree

One of the core challenges for a data team is to prove the ROI and measure the team's effectiveness. The author narrates an exciting approach based on The Value Driver Tree (VDT) method. By breaking down a value metric into components and linking data and analytics to these drivers, teams can demonstrate their direct impact on company growth and ensure a focus on high-value initiatives.

https://medium.com/zs-associates/identifying-data-driven-use-cases-with-a-value-driver-tree-bd5795e26e21

Spotify: Large-Scale Generation of ML Podcast Previews at Spotify with Google Dataflow

Spotify writes about leveraging Google Dataflow to generate large-scale machine learning-based podcast previews, improving user experience and content discoverability. The system utilizes parallel processing and data partitioning techniques to handle millions of podcast episodes efficiently, enabling scalable and cost-effective podcast preview generation.

https://engineering.atspotify.com/2023/04/large-scale-generation-of-ml-podcast-previews-at-spotify-with-google-dataflow/

Etsy: Barista: Enabling Greater Flexibility in Machine Learning Model Deployment

Etsy introduces Barista, a system designed to streamline machine learning model deployment, offering improved flexibility, reliability, and efficiency. Barista provides a unified interface for deploying and managing models across various platforms and languages, removing complexities tied to infrastructure and deployment specifics. Barista empowers data scientists and engineers to focus on model development and optimization by automating infrastructure provisioning and deployment workflows. The system also supports versioning, rollback, and monitoring capabilities, ensuring smooth model updates and transitions. With Barista, Etsy aims to reduce time-to-production and enhance the overall Machine Learning lifecycle, ultimately driving business value and innovation.

https://www.etsy.com/codeascraft/barista-enabling-greater-flexibility-in-machine-learning-model-deployment

All rights reserved ProtoGrowth Inc, India. I have provided links for informational purposes and do not suggest endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.

Data Engineering Weekly