Data Engineering Weekly #220

The Weekly Data Engineering Newsletter

May 11, 2025

Dagster Running Dagster: Our Open Platform

We’re pulling back the curtain. Join us on May 13 for a live deep dive into how Dagster Labs runs Dagster in production. One of our lead data engineers will walk through our real-world implementation, architecture decisions, and the lessons we've learned scaling the platform.

Register now

Editor’s Note: OpenXData Conference - 2025 - A Free Virtual Event

A free virtual event on open data architectures - Iceberg, Hudi, lakehouses, query engines, and more. Talks from Netflix, dbt Labs, Databricks, Microsoft, Google, Meta, Peloton, and other open data geeks.

May 21st, 9 am-3 pm PDT. There will be no fluff. You will experience solid content, good vibes, and a live giveaway!

Register Now

Rasmus Holm: A Critical Look at MCP

There is growing support for better interoperability of business assets with LLMs. MCP is the first protocol trying to address this. The author examines the MCP protocol, the adoption of HTTP server-side events vs. the WebSocket approach, and the security implications.

https://raz.sh/blog/2025-05-02_a_critical_look_at_mcp

Alibaba: A Comprehensive Analysis and Practical Implementation of the New Features in the MCP Specification

When I delved further into learning about the MCP specification, Alibaba's blog was a handy guide to understanding the protocol spec's evolution over the last four months. The six security principles of the MCP protocol are interesting to read to understand the upcoming MCP protocol improvements around authentication and authorization.

https://www.alibabacloud.com/blog/a-comprehensive-analysis-and-practical-implementation-of-the-new-features-in-the-mcp-specification_602206

Thoughtworks: Function calling using LLMs - Building AI Agents that interact with the external world

The MCP protocol focuses on LLM’s ability to discover and interact with the external world, invoking the external functions/ tools critical to extend the capabilities of Agents. The author gives an excellent overview of function calling with restricting agent actions and guardrails against prompt injections.

The blog raised a critical question, which I believe the industry is highly divided on: Can this pattern replace traditional rule engine Saas products?

https://martinfowler.com/articles/function-call-LLM.html

DoorDash: How DoorDash leverages LLMs to evaluate search result pages

DoorDash describes AutoEval, their human-in-the-loop, LLM-powered automated search quality evaluation system, designed to overcome traditional human annotation's scalability, latency, and consistency challenges. AutoEval utilizes LLMs to assess search relevance at scale by sampling user queries, constructing detailed prompts based on internal rating guidelines and structured context, performing LLM inference (using base or fine-tuned models), and aggregating judgments using their custom whole-page relevance (WPR) metric.

https://careersatdoordash.com/blog/doordash-llms-to-evaluate-search-result-pages/

Shopify: Evolution of Product Classification at Shopify - From Categories to Comprehensive Product Understanding

Shopify details their journey in product understanding, evolving from basic classification to a sophisticated system built on Vision Language Models (VLMs) and the Shopify Product Taxonomy (over 10,000 categories, 1,000+ attributes). The blog narrates the adoption of VLMs (like Qwen2VL 7B with FP8 quantization and in-flight batching) for multi-modal understanding, zero-shot learning, and natural language reasoning to classify products and extract attributes within their taxonomy.

https://shopify.engineering/evolution-product-classification

Netflix: Behind the Scenes - Building a Robust Ads Event Processing Pipeline

Netflix writes about the evolution of its ad processing pipeline from third party providers to inhouse systems. The system design consist of a centralized ad event collection system (Ads Event Publisher) to consolidate common operations (decryption, enrichment, hashing) and provide a unified, extensible data contract for various downstream real-time and batch consumers like frequency capping, ads metrics (using Flink and Druid), ad sessionization (Flink), the original Ads Event Handler, and billing/reporting workflows.

https://netflixtechblog.com/behind-the-scenes-building-a-robust-ads-event-processing-pipeline-e4e86caf9249

Lyft: Real-Time Spatial Temporal Forecasting at Lyft

Lyft writes about its real-time spatial-temporal forecasting system. The blog narrates its forecasting architecture using Apache Beam/Flink, Kafka, Kinesis, DynamoDB, Lyft's ML Platform (Airflow, SageMaker), and ClickHouse for feature generation, online inference (with potential for real-time refitting), and performance monitoring, emphasizing an asynchronous design for scalability.

https://eng.lyft.com/real-time-spatial-temporal-forecasting-lyft-fa90b3f3ec24

Meta: Collective Wisdom of Models - Advanced Feature Importance Techniques at Meta

Meta writes about the "Global Feature Importance" framework to address challenges in feature exploration and selection for machine learning models, particularly when dealing with thousands of features across numerous models. Their approach involves logging feature importance runs from various models, normalizing these scores (using percentiles to make them comparable), and then aggregating them to generate a global importance score for each feature.

https://medium.com/@AnalyticsAtMeta/collective-wisdom-of-models-advanced-feature-importance-techniques-at-meta-1a7a8d2f9e27

Flipkart: Transforming Data Analytics at Flipkart - Self Serve Insights on Petabytes scale data

Flipkart writes about building Plato, an internal analytics platform, to address the challenges of performing data analysis and enabling self-service BI. The architecture includes a user-friendly front-end (Plato Explorer, Model Builder, Data Copilot for NLQ) and a back-end that handles query redirection, optimization, intelligent materialization (cubing), and execution across various storage and processing engines (Spark, Flink, BigQuery, Druid).

https://blog.flipkart.tech/transforming-data-analytics-at-flipkart-self-serve-insights-on-petabytes-scale-data-fa59caf2bc54

All rights reserved, ProtoGrowth Inc., India. I have provided links for informational purposes and do not suggest endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.

Data Engineering Weekly