Data Engineering Weekly #213

The Weekly Data Engineering Newsletter

Mar 23, 2025

Editor’s Note: Data Council 2025, Apr 22-24, Oakland, CA

Data Council has always been one of my favorite events to connect with and learn from the data engineering community. Data Council 2025 is set for April 22-24 in Oakland, CA. As a special perk for Data Engineering Weekly subscribers, you can use the code dataeng20 for an exclusive 20% discount on tickets!

https://www.datacouncil.ai/bay-2025

Tristan Handy: How AI will disrupt data engineering as we know it

I think it will be hard to compare data engineering in 2024 and data engineering in 2028 and say those are the same things.

Interestingly, I recently shared a similar phrase with a data team I advise in my spare time. I’m curious to observe how the industry raises its level of abstraction as teams integrate AI tooling into their workflows.

https://www.getdbt.com/blog/how-ai-will-disrupt-data-engineering

Georg Heiler: Upskilling data engineers

What should I prefer for 2028, or how can I break into data engineering? These are common LinkedIn requests. I honestly don’t have a solid answer, but this blog is an excellent overview of upskilling. The author emphasizes the importance of mastering state management, understanding "local first" data processing (prioritizing single-node solutions before distributed systems), and leveraging an asset graph approach for data pipelines. The author stresses clear communication with stakeholders, continuous learning, and practical experience.

https://georgheiler.com/post/learning-data-engineering

Thane Ruthenis: A Bear Case - My Predictions Regarding AI Progress

Are we truly making progress towards AGI? The author presents a "bear case" regarding AI advancement, predicting that current methods, such as scaling large language models (LLMs) and employing techniques like reinforcement learning (RL) and chain-of-thought (CoT), will not lead to Artificial General Intelligence (AGI). The author concludes that while LLMs will become useful tools, a different approach will likely be necessary for AGI, potentially in the 2030s.

https://www.lesswrong.com/posts/oKAFFvaouKKEhbBPm/a-bear-case-my-predictions-regarding-ai-progress

Sebastian Raschka: The State of LLM Reasoning Models

This is an interesting series about reasoning models. The author discusses inference-time compute scaling methods and categorizes approaches into inference-time compute scaling, pure reinforcement learning, reinforcement learning with supervised fine-tuning, and supervised fine-tuning with distillation. The article highlights recent research papers that explore techniques like "wait" tokens, test-time preference optimization, thought-switching penalties, adversarial robustness, and various search strategies.

https://magazine.sebastianraschka.com/p/state-of-llm-reasoning-and-inference-scaling

Collin Prather: The Fallacy of Data-Driven Strategy

Recently, I had an intriguing conversation with a friend who explained how surveys systematically undermine society. The article resonated with me when I read it. The author contends that depending solely on data to shape business strategy is a fallacy, akin to how rote memorization in mathematics fails to lead to the discovery of new theorems. Data professionals should enhance strategy through deep engagement with organizational realities and puzzling facts by providing context and understanding, complementing the need for creative and context-sensitive thinking.

https://locallyoptimistic.com/post/the-fallacy-of-data-driven-strategy/

Netflix: Foundation Model for Personalized Recommendation

Netflix discusses developing a foundation model for personalized recommendations, inspired by large language models (LLMs), to centralize member preference learning and streamline their recommender system. The article covers tokenizing user interactions, incorporating both request-time and post-action features, adapting model objectives (multi-token prediction, auxiliary objectives), addressing unique challenges like entity cold-starting (using incremental training, combining ID-based and metadata-based embeddings), and outlining downstream applications (direct prediction, embedding utilization, fine-tuning).

https://netflixtechblog.medium.com/foundation-model-for-personalized-recommendation-1a0bd8e02d39

LinkedIn: Journey of next-generation control plane for data systems

LinkedIn writes about the evolution of Nuage, its internal control plane framework for managing data infrastructure resources. Initially a self-service platform (Nuage 1.0), it transitioned to a decentralized model (Nuage 2.0) and then to Nuage 3.0, which features centralized management, decoupled logic, enhanced security, improved performance, and simplified onboarding. The article highlights Nuage 3.0's architecture, key capabilities (discoverability, access control, resource management, monitoring), client interfaces (UI, APIs, CLIs), benefits (agility, ownership, performance, security), and future considerations like self-serve onboarding, infrastructure as code, and an AI assistant.

https://www.linkedin.com/blog/engineering/infrastructure/journey-of-next-generation-control-plane-for-data-systems

Airbnb: Embedding-Based Retrieval for Airbnb Search

Airbnb writes about building an Embedding-Based Retrieval (EBR) system for its search, designed to efficiently narrow down a large pool of potential listings into a smaller, more relevant set for further ranking. The article details constructing training data using contrastive learning with positive and negative listing pairs based on user trips, a two-tower model architecture that separates listing and query features for offline and online processing, and an online serving strategy using an inverted file index (IVF) with Euclidean distance for efficient retrieval and balanced clustering, and highlights the significant improvement in booking.

https://medium.com/airbnb-engineering/embedding-based-retrieval-for-airbnb-search-aabebfc85839

Grab: Improving Hugo's stability and addressing oncall challenges through automation.

Grab writes about Hugo, its data ingestion platform's pipeline monitoring, diagnosis, and resolution to improve stability and address on-call challenges. Grab narrates how it built a system with modules for signal collection (failures, SLA misses, data quality issues), diagnosis (identifying root causes and assignees), an RCA table, auto-resolution (using custom handlers and retry mechanisms), a data health API (for external access), and a Data Health Workbench (a dashboard for visualization and manual intervention), leading to improved data visibility, reduced downtime, and a lighter on-call workload.

https://engineering.grab.com/improving-hugo-stability

Salesforce: How to ETL at Petabyte-Scale with Trino

Can you utilize Trino for running your ETL pipeline or simply for ad-hoc analytics? The author concludes that Trino performs best when the ETL is designed to accommodate some of Trino’s limitations (such as keeping ETL queries short to facilitate easy failure recovery) and when a reliable external system like Apache Airflow manages retries and state controls.

https://engineering.salesforce.com/how-to-etl-at-petabyte-scale-with-trino-5fe8ac134e36/

All rights reserved ProtoGrowth Inc, India. I have provided links for informational purposes and do not suggest endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.

Data Engineering Weekly