Data Engineering Weekly

Share this post

Data Engineering Weekly #143

www.dataengineeringweekly.com

Data Engineering Weekly #143

The Weekly Data Engineering Newsletter

Ananth Packkildurai
Aug 21, 2023
11
Share this post

Data Engineering Weekly #143

www.dataengineeringweekly.com
Share

Data Engineering Weekly Is Brought to You by RudderStack

RudderStack Profiles takes the SaaS guesswork, and SQL grunt work out of building complete customer profiles, so you can quickly ship actionable, enriched data to every downstream team. See how it works today.


Editor’s Note: DewCon.ai - October 12, Bengaluru, India Update

Hey folks! 📣 Exciting news! We've received 30+ talk submissions and have confirmed all our speakers for the conference. 🎤 And guess what? We've given our conference website a fresh look. 🌐

Tickets are selling fast; use the code DATAHERO for a special discount. 🎟️ Oh, and if your company's thinking of bulk booking, drop an email to ananth@dataengineeringweekly.com to get some awesome discounts. 📩

Looking forward to seeing you all! 👋🙂

Register Now →

Chip Huyen: Open challenges in LLM research

2023 is where we see the best minds and the money spent on improving LLM. Though the promise of LLM is great, many operational challenges remain open. The author gives an excellent overview of open challenges in LLM as of now.

Timeline of advances of the three major methods in photonic matrix multiplication
  1. Reduce and measure hallucinations

  2. Optimize context length and context construction

  3. Incorporate other data modalities

  4. Make LLMs faster and cheaper

  5. Design a new model architecture

  6. Develop GPU alternatives

  7. Make agents usable

  8. Improve learning from human preference

  9. Improve the efficiency of the chat interface

  10. Build LLMs for non-English languages

https://huyenchip.com/2023/08/16/llm-research-open-challenges.html


Alibaba: How Generative AI Can Revolutionize Data Engineering

It is the most asked and debated question in the data engineering space how GEN-AI can revolutionize data engineering. We’ve seen text-to-SQL generators, Gen-AI SDK, auto-generate documentation, etc. The blog narrates the potential possibilities of LLM’s impact in each stage of the data warehouse.

https://www.alibabacloud.com/blog/how-generative-ai-can-revolutionize-data-engineering_600290


Microsoft: Fundamentals of building with LLMs: Question & answer on any document with ChatGPT in 30 lines of code!

Though LLM is hotly debated, many companies are still trying to figure out how LLM can help to improve their product experience. We see a floating of how-to’s on LLM, LangChain, and other related technologies. I found the article from Microsoft to be a good summarization of building a Q&A system for any document.

https://medium.com/data-science-at-microsoft/fundamentals-of-building-with-llms-question-answer-on-any-document-with-chatgpt-in-30-lines-of-9f0d436baff1


Sponsored: Great Data Debate–The State of Data Mesh

Since 2019, the data mesh has woven itself into every blog post, event presentation, and webinar. But 4 years later, in 2023 — where has the data mesh gotten us? Does its promise of a decentralized dreamland hold true?

Atlan is bringing together data leaders like Abhinav Sivasailam (CEO, Levers Labs), Barr Moses (Co-founder & CEO, Monte Carlo), Scott Hirleman (Founder & CEO, Data Mesh Understanding), Teresa Tung (Cloud First Chief Technologist, Accenture), Tristan Handy (Founder & CEO, dbt Labs), Prukalpa Sankar (Co-founder, Atlan), and more at the next edition of the Great Data Debate to discuss the state of data mesh – tech toolkit and cultural shift required to implement data mesh.

Watch the Recording of the Great Data Debate →


Instacart: Supercharging ML/AI Foundations at Instacart

Instacart wrote two back-to-back blogs this week about the vision behind ML/ AI and the usage of dbt.

Most of these ML models were trained either on laptops or custom infrastructure developed within each team with no common patterns, and sometimes it took more than a month to put a model into production. In early 2021, we built an in-house ML platform to enable our teams to construct, deploy, serve, and manage ML models and features at scale, ensuring their efficacy, dependability, and security throughout the ML lifecycle.

The blog is an excellent reminder to start somewhere small, then standardize as a platform and improve.

https://tech.instacart.com/supercharging-ml-ai-foundations-at-instacart-d48214a2b511


Instacart: Adopting dbt as the Data Transformation Tool at Instacart

Instacart writes about its adoption of dbt as the data transformation tool. The integration story is a repeating design pattern where an adopted code parses the manifest.json file and constructs dynamic DAGs in Airflow.

I’m curious to know if the DAGs are generated dynamically or run in a compile-time and checked in as a code. I’ve seen a pretty unstable system with dynamic DAG generation and switched to DAG generation at compile time.

https://tech.instacart.com/adopting-dbt-as-the-data-transformation-tool-at-instacart-36c74bc407df.


Sponsored: You're invited to IMPACT - The Data Observability Summit | November 8, 2023

Interested in learning how some of the best teams achieve data & AI reliability at scale? Learn from today's top data leaders and architects at The Data Observability Summit on how to build more trustworthy and reliable data & AI products with the latest technologies, processes, and strategies shaping our industry (yes, LLMs will be on the table).

RSVP NOW


LanceDB: Vector similarity search with duckdb

DuckDB is truly becoming the defacto federated query engine for analytics. The recent addition of the Postgres Scanner looks very promising. I’m looking forward to DuckDB integration with the LakeHouse systems like Hudi, Iceberg, and DeltaLake. The blog narrates how to use Postgres’s pgvector to run a vector similarity search using DuckDB.

https://blog.lancedb.com/vector-similarity-search-with-duckdb-44dec043532a.


Spotify: Experimentation at Spotify: Three Lessons for Maximizing Impact in Innovation

In the digital realm, experimentation isn't just a strategy; it's the lifeblood of innovation and the compass that guides success.

Spotify writes three basic guiding principles to use the experimentation method better and achieve a more tangible impact for the business in a mature context.

  1. Start with the decision that needs to be made.

  2. Utilize localization to innovate for homogeneous populations.

  3. Break the feature apart into its most critical pieces.

https://engineering.atspotify.com/2023/08/experimentation-at-spotify-three-lessons-for-maximizing-impact-in-innovation/


Sponsored: Webinar: Unlock AI-driven personalization with RudderStack & Snowflake

August 30, Join Wyze’s Director of Data Engineering, Wei Zhou, and Senior Data Scientist, Pei Guo, to learn how they’re using RudderStack and Snowflake to collect clean, comprehensive data, quickly model it into an identity graph and customer 360 tables, then make that data available to their AI team for modeling directly inside of Snowflake’s Data Cloud.

Register now


Grab: Streamlining Grab's Segmentation Platform with faster creation and lower latency.

User segmentation divides a company's audience into groups based on shared characteristics, behaviors, or needs. These groups, or segments, enable businesses to deliver more tailored marketing and product experiences. Companies can achieve greater engagement and conversion rates by understanding and targeting specific segments.

Grab writes about its user segmentation platform architecture, focusing on segment creation & segment serving and challenges in running at scale.

https://engineering.grab.com/streamlining-grabs-segmentation-platform


Sarah Krasnik Bedell: Guide to Anonymous Identity Resolution

Speaking of User Segmentation challenges, one of the key challenges in Segment creation is Identity Resolution, especially with anonymous user visits.

The author narrates the current challenges with identity resolution and potential steps to build robust systems to help the user segmentation process.

https://sarahsnewsletter.substack.com/p/guide-to-anonymous-identity-resolution


Lyft: Where’s My Data — A Unique Encounter with Flink Streaming’s Kinesis Connector

Everything breaks at scale, which is a unique place to learn to operate a system. Lyft shares such cases of scaling challenges and uncertainty with Flink Streaming’s Kinesis connector. The chapter-by-chapter explanation of the system's behavior is an exciting read.

The Flink Kinesis consumer reading from Kinesis and writing to a downstream buffer through an emit queue and emitting watermarks

https://eng.lyft.com/wheres-my-data-a-unique-encounter-with-flink-streaming-s-kinesis-connector-6da3b11b164a


All rights reserved ProtoGrowth Inc, India. I have provided links for informational purposes and do not suggest endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.

11
Share this post

Data Engineering Weekly #143

www.dataengineeringweekly.com
Share
Comments
Top
New
Community

No posts

Ready for more?

© 2023 Ananth Packkildurai
Privacy ∙ Terms ∙ Collection notice
Start WritingGet the app
Substack is the home for great writing