Data Engineering Weekly Is Brought to You by RudderStack
RudderStack provides data pipelines that make collecting data from every application, website, and SaaS platform easy, then activating it in your warehouse and business tools. Sign up free to test out the tool today.
Editor’s Note: 🚀🌐🎉 Calling all Data Engineering Weekly Readers! 🎉🌐🚀
Data Engineering Weekly is joining forces with Rudderstack and The Data Stack Show to bring you The State of Data Engineering Survey 2023! 📊
⏱️Got 5 minutes? ⏱️ Lend us a hand and share your valuable insights on:
🎯 Data team priorities for 2023
👥 Team dynamics
🛠️ Data stacks
🔍 Identity resolution strategies
📚 Data team roles
Please help us create a comprehensive report that will be featured in Data Engineering Weekly 🗞️. Plus, I'll hop on a special episode of The Data Stack Show to discuss the results 🎙️.
And the cherry on top? 🍒 We'll send exclusive The Data Stack Show swag just for participating! 🎁
Take the survey now! ➡️ RudderStack.com/survey ⬅️
Chip Huyen: Building LLM applications for production
The article is one of the best reads of 2023 for me. I print this out and read it a couple of times. The blog narrates the difference between deterministic and non-deterministic system design and the challenges of building LLM applications in production. The author explains the difference between prompting and fine-tuning and when to choose what.
https://huyenchip.com/2023/04/11/llm-engineering.html
The flow control in the LLM application is an exciting read, and a generalized programming model will emerge soon. The speed of innovation in this space is mind-blowing when I came across Microsoft Semantic Kernal.
Microsoft Semantic Kernal, a lightweight SDK that lets you easily mix conventional programming languages with the latest in Large Language Model (LLM) AI "prompts" with templating, chaining, and planning capabilities out-of-the-box.
https://learn.microsoft.com/en-us/semantic-kernel/whatissk
Matt Palmer: Hot Takes on the Modern Data Stack
Wow 🙌🏼 I applaud 👏🏽 author’s thought process on writing the hot take and kept it hot indeed 😊 The hot takes are
dbt lacks some basic functionality expected of a best-in-class tool
Deploying a “production” data warehouse is unnecessarily hard and gated by tribal knowledge.
Redshift is no longer a true competitor in the warehouse space.
Airflow is obsolete.
Airbyte is not production-grade software.
#1 On dbt; Yes, I completely agree with the author’s take that it lacks the basic functionality expected of a best-in-class tool. For me personally, the fact that dbt doesn’t have a backfill mechanism, no easy option to build a date-time partition table, and always requires relying on Airflow as a scheduler engine is 🤯 Benn Stancil recently wrote about the dbt dilemma as peacetime dbt vs. war time dbt in the article, How dbt succeeds.
#2 I agree; permission is a mess. However, as the data stack consolidation accelerates, we will see a much better permission model from Snowflake and Databricks.
#3 Yes, AWS, please #SaveRedshift
#4 On Airflow, hmmm still trying to figure it out. Do I like Airflow? No, but the current state of mind is that a known devil is better than an unknown devil.
#5. I’ve not tried Airbyte yet but I heard a similar sentiment from other folks.
https://mattpalmer.io/posts/hot-takes/
DoorDash: Using Metrics Layer to Standardize and Scale Experimentation at DoorDash
It is exciting to see wide adoption and implementation study of the Metric Layer from various companies. DoorDash writes about its journey to build a Metric Layer for experimentation. One of the things that stands out for me in the article is the debate around pre-aggregation vs. lazy aggregation. While reading Airbnb’s Minerva blog about XRF[ eXecutive Reporting Framework] that aggregate, I can’t stop wondering how expensive this system will be.
As someone who contributed early to Apache Pinot, I believe, as the author highlighted in the blog, that Apache Pinot or a Pinot-like system would play a significant role in the metrics layer.
Meta: The Future of the data engineer — Part I
Meta introduced a new term, “Accessible Analytics,” - self-describing to the extent that it doesn’t require specialized skills to draw meaningful insights from it.
Meta shares its ever-changing landscape of data engineering. The author narrates how data engineering focused more on data integration at the initial stage and how it shifted towards building Accessible Analytics. Analytical Engineering is the role often quoted for data engineers who do Accessible Analytics. Apart from marketing terms, it is fascinating to see how the role changes along with the maturity of an organization.
https://medium.com/@AnalyticsAtMeta/the-future-of-the-data-engineer-part-i-32bd125465be
HelloFresh: Enabling teams access to their data with a low-code ETL tool
HelloFresh writes about a self-serving low-code ETL tool that enables the analytical team to self-import the data sources. The abstraction builds on a known data model with slowly changing dimensions.
Sponsored: [Virtual Data Panel] Measuring Data Team ROI
As data leaders, one of our top priorities is to measure ROI. From tracking the efficacy of marketing campaigns to understanding the root cause of new spikes in user engagement, we’re tasked with keeping tabs on the business's health at all levels. But what about the ROI of our own teams? Watch a panel of data leaders as they discuss how to build strategies for measuring data team ROI.
Watch On-demand
Niels Claeys: Use dbt and Duckdb instead of Spark in data pipelines
AWS u-24tb1.metal instances offering a 24TiB memory 💾💾💾 and the advancement of RDMA bringing memory-centric data engineering. DuckDB, a leading in-memory columnar db, is an exciting system to watch out for. The author writes one such case using dbt & DuckDb for the analytical workload instead of Apache Spark.
https://medium.com/datamindedbe/use-dbt-and-duckdb-instead-of-spark-in-data-pipelines-9063a31ea2b5
Jacopo Tagliabue: “Is This You?” Entity Matching in the Modern Data Stack with Large Language Models
Entity Resolution is the fundamental challenge in the Data Integration system. Traditionally we tried or still solving using MDM (Master Data Management) systems. The CDP (Customer Data Platform) pushed away folks from Entity resolution since the use cases like marketing campaign don’t requires strict survivorship rules. The author takes the Entity Matching problem and describes how GPT-3 with Modern Data Stack (MDS) works.
I am profoundly interested in Master Data Management, and you can expect more guest articles and interviews in Data Engineering Weekly shortly!!!
Sponsored: Warehouse-first analytics and experimentation with RudderStack and Eppo
Find out how Phantom transitioned from siloed analytics to a warehouse-first stack that enables A/B experimentation directly on top of the data warehouse. You'll learn from Eppo founder Chetan Sharma, RudderStack DevRel leader Sara Mashfej, and Phantom Senior Data Engineer Ricardo Pinho.
Diogo Silva Santos: Why Data Debt Is the Next Technical Debt You Need to Worry About
Data debt is the cost of avoiding or delaying investment in maintaining, updating, or managing data assets, leading to decreased efficiency, increased costs, and potential risks.
Excellent narration of data debt (Surprisingly, this is my first time hearing this term). The author narrates some of the root causes of data debt, and one of the primary causes of data debt is? You guessed it correctly. "No proper data contracts/collaboration between software and data engineers”
Shri Salem: Identifying data-driven use cases with a value driver tree
One of the core challenges for a data team is to prove the ROI and measure the team's effectiveness. The author narrates an exciting approach based on The Value Driver Tree (VDT) method. By breaking down a value metric into components and linking data and analytics to these drivers, teams can demonstrate their direct impact on company growth and ensure a focus on high-value initiatives.
Spotify: Large-Scale Generation of ML Podcast Previews at Spotify with Google Dataflow
Spotify writes about leveraging Google Dataflow to generate large-scale machine learning-based podcast previews, improving user experience and content discoverability. The system utilizes parallel processing and data partitioning techniques to handle millions of podcast episodes efficiently, enabling scalable and cost-effective podcast preview generation.
Etsy: Barista: Enabling Greater Flexibility in Machine Learning Model Deployment
Etsy introduces Barista, a system designed to streamline machine learning model deployment, offering improved flexibility, reliability, and efficiency. Barista provides a unified interface for deploying and managing models across various platforms and languages, removing complexities tied to infrastructure and deployment specifics. Barista empowers data scientists and engineers to focus on model development and optimization by automating infrastructure provisioning and deployment workflows. The system also supports versioning, rollback, and monitoring capabilities, ensuring smooth model updates and transitions. With Barista, Etsy aims to reduce time-to-production and enhance the overall Machine Learning lifecycle, ultimately driving business value and innovation.
All rights reserved ProtoGrowth Inc, India. I have provided links for informational purposes and do not suggest endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.