Dagster Running Dagster: Our Open Platform
We’re pulling back the curtain. Join us on May 13 for a live deep dive into how Dagster Labs runs Dagster in production. One of our lead data engineers will walk through our real-world implementation, architecture decisions, and the lessons we've learned scaling the platform.
Editor’s Note: OpenXData Conference - 2025 - A Free Virtual Event
A free virtual event on open data architectures - Iceberg, Hudi, lakehouses, query engines, and more. Talks from Netflix, dbt Labs, Databricks, Microsoft, Google, Meta, Peloton, and other open data geeks.
May 21st, 9 am-3 pm PDT. There will be no fluff. You will experience solid content, good vibes, and a live giveaway!
Rasmus Holm: A Critical Look at MCP
There is growing support for better interoperability of business assets with LLMs. MCP is the first protocol trying to address this. The author examines the MCP protocol, the adoption of HTTP server-side events vs. the WebSocket approach, and the security implications.
https://raz.sh/blog/2025-05-02_a_critical_look_at_mcp
Alibaba: A Comprehensive Analysis and Practical Implementation of the New Features in the MCP Specification
When I delved further into learning about the MCP specification, Alibaba's blog was a handy guide to understanding the protocol spec's evolution over the last four months. The six security principles of the MCP protocol are interesting to read to understand the upcoming MCP protocol improvements around authentication and authorization.
Thoughtworks: Function calling using LLMs - Building AI Agents that interact with the external world
The MCP protocol focuses on LLM’s ability to discover and interact with the external world, invoking the external functions/ tools critical to extend the capabilities of Agents. The author gives an excellent overview of function calling with restricting agent actions and guardrails against prompt injections.
The blog raised a critical question, which I believe the industry is highly divided on: Can this pattern replace traditional rule engine Saas products?
https://martinfowler.com/articles/function-call-LLM.html
Sponsored: The Data Platform Fundamentals Guide
Learn the fundamental concepts to build a data platform in your organization.
- Tips and tricks for data modeling and data ingestion patterns
- Explore the benefits of an observation layer across your data pipelines
- Learn the key strategies for ensuring data quality for your organization
DoorDash: How DoorDash leverages LLMs to evaluate search result pages
DoorDash describes AutoEval, their human-in-the-loop, LLM-powered automated search quality evaluation system, designed to overcome traditional human annotation's scalability, latency, and consistency challenges. AutoEval utilizes LLMs to assess search relevance at scale by sampling user queries, constructing detailed prompts based on internal rating guidelines and structured context, performing LLM inference (using base or fine-tuned models), and aggregating judgments using their custom whole-page relevance (WPR) metric.
https://careersatdoordash.com/blog/doordash-llms-to-evaluate-search-result-pages/
Shopify: Evolution of Product Classification at Shopify - From Categories to Comprehensive Product Understanding
Shopify details their journey in product understanding, evolving from basic classification to a sophisticated system built on Vision Language Models (VLMs) and the Shopify Product Taxonomy (over 10,000 categories, 1,000+ attributes). The blog narrates the adoption of VLMs (like Qwen2VL 7B with FP8 quantization and in-flight batching) for multi-modal understanding, zero-shot learning, and natural language reasoning to classify products and extract attributes within their taxonomy.
https://shopify.engineering/evolution-product-classification
Netflix: Behind the Scenes - Building a Robust Ads Event Processing Pipeline
Netflix writes about the evolution of its ad processing pipeline from third party providers to inhouse systems. The system design consist of a centralized ad event collection system (Ads Event Publisher) to consolidate common operations (decryption, enrichment, hashing) and provide a unified, extensible data contract for various downstream real-time and batch consumers like frequency capping, ads metrics (using Flink and Druid), ad sessionization (Flink), the original Ads Event Handler, and billing/reporting workflows.
Lyft: Real-Time Spatial Temporal Forecasting at Lyft
Lyft writes about its real-time spatial-temporal forecasting system. The blog narrates its forecasting architecture using Apache Beam/Flink, Kafka, Kinesis, DynamoDB, Lyft's ML Platform (Airflow, SageMaker), and ClickHouse for feature generation, online inference (with potential for real-time refitting), and performance monitoring, emphasizing an asynchronous design for scalability.
https://eng.lyft.com/real-time-spatial-temporal-forecasting-lyft-fa90b3f3ec24
Meta: Collective Wisdom of Models - Advanced Feature Importance Techniques at Meta
Meta writes about the "Global Feature Importance" framework to address challenges in feature exploration and selection for machine learning models, particularly when dealing with thousands of features across numerous models. Their approach involves logging feature importance runs from various models, normalizing these scores (using percentiles to make them comparable), and then aggregating them to generate a global importance score for each feature.
Flipkart: Transforming Data Analytics at Flipkart - Self Serve Insights on Petabytes scale data
Flipkart writes about building Plato, an internal analytics platform, to address the challenges of performing data analysis and enabling self-service BI. The architecture includes a user-friendly front-end (Plato Explorer, Model Builder, Data Copilot for NLQ) and a back-end that handles query redirection, optimization, intelligent materialization (cubing), and execution across various storage and processing engines (Spark, Flink, BigQuery, Druid).
All rights reserved, ProtoGrowth Inc., India. I have provided links for informational purposes and do not suggest endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.