Data Engineering Weekly Is Brought to You by RudderStack
RudderStack Profiles takes the SaaS guesswork, and SQL grunt work out of building complete customer profiles, so you can quickly ship actionable, enriched data to every downstream team. See how it works today.
Editor’s Note: DewCon.ai - October 12, Bengaluru, India Update
Hey folks! 📣 Exciting news! We've received 30+ talk submissions and have confirmed all our speakers for the conference. 🎤 And guess what? We've given our conference website a fresh look. 🌐
Tickets are selling fast; use the code DATAHERO for a special discount. 🎟️ Oh, and if your company's thinking of bulk booking, drop an email to ananth@dataengineeringweekly.com to get some awesome discounts. 📩
Looking forward to seeing you all! 👋🙂
Chip Huyen: Open challenges in LLM research
2023 is where we see the best minds and the money spent on improving LLM. Though the promise of LLM is great, many operational challenges remain open. The author gives an excellent overview of open challenges in LLM as of now.
Reduce and measure hallucinations
Optimize context length and context construction
Incorporate other data modalities
Make LLMs faster and cheaper
Design a new model architecture
Develop GPU alternatives
Make agents usable
Improve learning from human preference
Improve the efficiency of the chat interface
Build LLMs for non-English languages
https://huyenchip.com/2023/08/16/llm-research-open-challenges.html
Alibaba: How Generative AI Can Revolutionize Data Engineering
It is the most asked and debated question in the data engineering space how GEN-AI can revolutionize data engineering. We’ve seen text-to-SQL generators, Gen-AI SDK, auto-generate documentation, etc. The blog narrates the potential possibilities of LLM’s impact in each stage of the data warehouse.
https://www.alibabacloud.com/blog/how-generative-ai-can-revolutionize-data-engineering_600290
Microsoft: Fundamentals of building with LLMs: Question & answer on any document with ChatGPT in 30 lines of code!
Though LLM is hotly debated, many companies are still trying to figure out how LLM can help to improve their product experience. We see a floating of how-to’s on LLM, LangChain, and other related technologies. I found the article from Microsoft to be a good summarization of building a Q&A system for any document.
Sponsored: Great Data Debate–The State of Data Mesh
Since 2019, the data mesh has woven itself into every blog post, event presentation, and webinar. But 4 years later, in 2023 — where has the data mesh gotten us? Does its promise of a decentralized dreamland hold true?
Atlan is bringing together data leaders like Abhinav Sivasailam (CEO, Levers Labs), Barr Moses (Co-founder & CEO, Monte Carlo), Scott Hirleman (Founder & CEO, Data Mesh Understanding), Teresa Tung (Cloud First Chief Technologist, Accenture), Tristan Handy (Founder & CEO, dbt Labs), Prukalpa Sankar (Co-founder, Atlan), and more at the next edition of the Great Data Debate to discuss the state of data mesh – tech toolkit and cultural shift required to implement data mesh.
Watch the Recording of the Great Data Debate →
Instacart: Supercharging ML/AI Foundations at Instacart
Instacart wrote two back-to-back blogs this week about the vision behind ML/ AI and the usage of dbt.
Most of these ML models were trained either on laptops or custom infrastructure developed within each team with no common patterns, and sometimes it took more than a month to put a model into production. In early 2021, we built an in-house ML platform to enable our teams to construct, deploy, serve, and manage ML models and features at scale, ensuring their efficacy, dependability, and security throughout the ML lifecycle.
The blog is an excellent reminder to start somewhere small, then standardize as a platform and improve.
https://tech.instacart.com/supercharging-ml-ai-foundations-at-instacart-d48214a2b511
Instacart: Adopting dbt as the Data Transformation Tool at Instacart
Instacart writes about its adoption of dbt as the data transformation tool. The integration story is a repeating design pattern where an adopted code parses the manifest.json file and constructs dynamic DAGs in Airflow.
I’m curious to know if the DAGs are generated dynamically or run in a compile-time and checked in as a code. I’ve seen a pretty unstable system with dynamic DAG generation and switched to DAG generation at compile time.
https://tech.instacart.com/adopting-dbt-as-the-data-transformation-tool-at-instacart-36c74bc407df.
Sponsored: You're invited to IMPACT - The Data Observability Summit | November 8, 2023
Interested in learning how some of the best teams achieve data & AI reliability at scale? Learn from today's top data leaders and architects at The Data Observability Summit on how to build more trustworthy and reliable data & AI products with the latest technologies, processes, and strategies shaping our industry (yes, LLMs will be on the table).
LanceDB: Vector similarity search with duckdb
DuckDB is truly becoming the defacto federated query engine for analytics. The recent addition of the Postgres Scanner looks very promising. I’m looking forward to DuckDB integration with the LakeHouse systems like Hudi, Iceberg, and DeltaLake. The blog narrates how to use Postgres’s pgvector to run a vector similarity search using DuckDB.
https://blog.lancedb.com/vector-similarity-search-with-duckdb-44dec043532a.
Spotify: Experimentation at Spotify: Three Lessons for Maximizing Impact in Innovation
In the digital realm, experimentation isn't just a strategy; it's the lifeblood of innovation and the compass that guides success.
Spotify writes three basic guiding principles to use the experimentation method better and achieve a more tangible impact for the business in a mature context.
Start with the decision that needs to be made.
Utilize localization to innovate for homogeneous populations.
Break the feature apart into its most critical pieces.
Sponsored: Webinar: Unlock AI-driven personalization with RudderStack & Snowflake
August 30, Join Wyze’s Director of Data Engineering, Wei Zhou, and Senior Data Scientist, Pei Guo, to learn how they’re using RudderStack and Snowflake to collect clean, comprehensive data, quickly model it into an identity graph and customer 360 tables, then make that data available to their AI team for modeling directly inside of Snowflake’s Data Cloud.
Grab: Streamlining Grab's Segmentation Platform with faster creation and lower latency.
User segmentation divides a company's audience into groups based on shared characteristics, behaviors, or needs. These groups, or segments, enable businesses to deliver more tailored marketing and product experiences. Companies can achieve greater engagement and conversion rates by understanding and targeting specific segments.
Grab writes about its user segmentation platform architecture, focusing on segment creation & segment serving and challenges in running at scale.
https://engineering.grab.com/streamlining-grabs-segmentation-platform
Sarah Krasnik Bedell: Guide to Anonymous Identity Resolution
Speaking of User Segmentation challenges, one of the key challenges in Segment creation is Identity Resolution, especially with anonymous user visits.
The author narrates the current challenges with identity resolution and potential steps to build robust systems to help the user segmentation process.
https://sarahsnewsletter.substack.com/p/guide-to-anonymous-identity-resolution
Lyft: Where’s My Data — A Unique Encounter with Flink Streaming’s Kinesis Connector
Everything breaks at scale, which is a unique place to learn to operate a system. Lyft shares such cases of scaling challenges and uncertainty with Flink Streaming’s Kinesis connector. The chapter-by-chapter explanation of the system's behavior is an exciting read.
All rights reserved ProtoGrowth Inc, India. I have provided links for informational purposes and do not suggest endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.