Data Engineering Weekly #143

The Weekly Data Engineering Newsletter

Aug 21, 2023

Data Engineering Weekly Is Brought to You by RudderStack

RudderStack Profiles takes the SaaS guesswork, and SQL grunt work out of building complete customer profiles, so you can quickly ship actionable, enriched data to every downstream team. See how it works today.

Editor’s Note: DewCon.ai - October 12, Bengaluru, India Update

Hey folks! 📣 Exciting news! We've received 30+ talk submissions and have confirmed all our speakers for the conference. 🎤 And guess what? We've given our conference website a fresh look. 🌐

Tickets are selling fast; use the code DATAHERO for a special discount. 🎟️ Oh, and if your company's thinking of bulk booking, drop an email to ananth@dataengineeringweekly.com to get some awesome discounts. 📩

Looking forward to seeing you all! 👋🙂

Register Now →

Chip Huyen: Open challenges in LLM research

2023 is where we see the best minds and the money spent on improving LLM. Though the promise of LLM is great, many operational challenges remain open. The author gives an excellent overview of open challenges in LLM as of now.

Timeline of advances of the three major methods in photonic matrix multiplication

Reduce and measure hallucinations
Optimize context length and context construction
Incorporate other data modalities
Make LLMs faster and cheaper
Design a new model architecture
Develop GPU alternatives
Make agents usable
Improve learning from human preference
Improve the efficiency of the chat interface
Build LLMs for non-English languages

https://huyenchip.com/2023/08/16/llm-research-open-challenges.html

Alibaba: How Generative AI Can Revolutionize Data Engineering

It is the most asked and debated question in the data engineering space how GEN-AI can revolutionize data engineering. We’ve seen text-to-SQL generators, Gen-AI SDK, auto-generate documentation, etc. The blog narrates the potential possibilities of LLM’s impact in each stage of the data warehouse.

https://www.alibabacloud.com/blog/how-generative-ai-can-revolutionize-data-engineering_600290

Microsoft: Fundamentals of building with LLMs: Question & answer on any document with ChatGPT in 30 lines of code!

Though LLM is hotly debated, many companies are still trying to figure out how LLM can help to improve their product experience. We see a floating of how-to’s on LLM, LangChain, and other related technologies. I found the article from Microsoft to be a good summarization of building a Q&A system for any document.

https://medium.com/data-science-at-microsoft/fundamentals-of-building-with-llms-question-answer-on-any-document-with-chatgpt-in-30-lines-of-9f0d436baff1

Instacart: Supercharging ML/AI Foundations at Instacart

Instacart wrote two back-to-back blogs this week about the vision behind ML/ AI and the usage of dbt.

Most of these ML models were trained either on laptops or custom infrastructure developed within each team with no common patterns, and sometimes it took more than a month to put a model into production. In early 2021, we built an in-house ML platform to enable our teams to construct, deploy, serve, and manage ML models and features at scale, ensuring their efficacy, dependability, and security throughout the ML lifecycle.

The blog is an excellent reminder to start somewhere small, then standardize as a platform and improve.

https://tech.instacart.com/supercharging-ml-ai-foundations-at-instacart-d48214a2b511

Instacart: Adopting dbt as the Data Transformation Tool at Instacart

Instacart writes about its adoption of dbt as the data transformation tool. The integration story is a repeating design pattern where an adopted code parses the manifest.json file and constructs dynamic DAGs in Airflow.

I’m curious to know if the DAGs are generated dynamically or run in a compile-time and checked in as a code. I’ve seen a pretty unstable system with dynamic DAG generation and switched to DAG generation at compile time.

https://tech.instacart.com/adopting-dbt-as-the-data-transformation-tool-at-instacart-36c74bc407df.

LanceDB: Vector similarity search with duckdb

DuckDB is truly becoming the defacto federated query engine for analytics. The recent addition of the Postgres Scanner looks very promising. I’m looking forward to DuckDB integration with the LakeHouse systems like Hudi, Iceberg, and DeltaLake. The blog narrates how to use Postgres’s pgvector to run a vector similarity search using DuckDB.

https://blog.lancedb.com/vector-similarity-search-with-duckdb-44dec043532a.

Spotify: Experimentation at Spotify: Three Lessons for Maximizing Impact in Innovation

In the digital realm, experimentation isn't just a strategy; it's the lifeblood of innovation and the compass that guides success.

Spotify writes three basic guiding principles to use the experimentation method better and achieve a more tangible impact for the business in a mature context.

Start with the decision that needs to be made.
Utilize localization to innovate for homogeneous populations.
Break the feature apart into its most critical pieces.

https://engineering.atspotify.com/2023/08/experimentation-at-spotify-three-lessons-for-maximizing-impact-in-innovation/

Grab: Streamlining Grab's Segmentation Platform with faster creation and lower latency.

User segmentation divides a company's audience into groups based on shared characteristics, behaviors, or needs. These groups, or segments, enable businesses to deliver more tailored marketing and product experiences. Companies can achieve greater engagement and conversion rates by understanding and targeting specific segments.

Grab writes about its user segmentation platform architecture, focusing on segment creation & segment serving and challenges in running at scale.

https://engineering.grab.com/streamlining-grabs-segmentation-platform

Sarah Krasnik Bedell: Guide to Anonymous Identity Resolution

Speaking of User Segmentation challenges, one of the key challenges in Segment creation is Identity Resolution, especially with anonymous user visits.

The author narrates the current challenges with identity resolution and potential steps to build robust systems to help the user segmentation process.

https://sarahsnewsletter.substack.com/p/guide-to-anonymous-identity-resolution

Lyft: Where’s My Data — A Unique Encounter with Flink Streaming’s Kinesis Connector

Everything breaks at scale, which is a unique place to learn to operate a system. Lyft shares such cases of scaling challenges and uncertainty with Flink Streaming’s Kinesis connector. The chapter-by-chapter explanation of the system's behavior is an exciting read.

The Flink Kinesis consumer reading from Kinesis and writing to a downstream buffer through an emit queue and emitting watermarks

https://eng.lyft.com/wheres-my-data-a-unique-encounter-with-flink-streaming-s-kinesis-connector-6da3b11b164a

All rights reserved ProtoGrowth Inc, India. I have provided links for informational purposes and do not suggest endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.

Data Engineering Weekly