Data Engineering Weekly #272

The Weekly Data Engineering Newsletter

Jun 01, 2026

How to Build a Data Platform

We wrote an eBook on Data Platform Fundamentals to help you be like the happy data teams, operating under a single platform.

In this book, you’ll learn:

- How composable architectures allow teams to ship faster
- Why data quality matters and how you can catch issues before they reach users
- What observability means, and how it will help you solve problems more quickly

Download your free copy now

Netflix: High-Throughput Graph Abstraction at Netflix

One emerging data model envisions the organization model as a property graph abstraction, though the underlying data may be stored in a relational model, key/value store, or document store. GraphQL is one abstraction, but it leads to many complications while integrating the api. Netflix writes about its high-throughput graph abstraction and explains how it handles read-aside cache and write-aside cache.

https://netflixtechblog.com/high-throughput-graph-abstraction-at-netflix-part-i-e88063e6f6d5

Slack: Slack AI - The Path to Multi-Cloud

Building AI Infrastructure for a multi-tenant stack with enterprise security requirements is challenging in itself. Slack writes about evolving from SageMaker-managed endpoints to Bedrock provisioned throughput, Bedrock on-demand spillover, and a multi-cloud Vertex AI architecture with normalized APIs, model hierarchies, circuit breakers, health-aware routing, and feature-level model selection. The platform focuses on reducing the infrastructure lock-in, enabling same-day model migration, improving the quality of complex reasoning by about 10%, cutting latency for low-token workloads by about 67%, and increasing resilience against regional and provider-wide failures.

https://slack.engineering/slack-ai-the-path-to-multi-cloud/

Uber: Solving the Identity Crisis for AI Agents

The article seems so timely, as I recently finished a similar design for AI Identity. The conventional way to start the design is to assign the user role to the agent to work on their behalf, but the model soon fails. The services-centric IAM role also fell apart, since the same service can potentially assume multiple roles. Uber provides an in-depth account of how it handles these challenges.

https://www.uber.com/us/en/blog/solving-the-agent-identity-crisis/

Mehul Batra: A field journal on Ray Data and Daft for multimodal data lake

Multimodal data lake pipelines on Kubernetes require engines that can handle mixed CPU, GPU, and async inference workloads, along with cataloging and checkpointing, across text, PDF, image, audio, video, and LLM metadata. The author compares Ray Data 2.55.1 and Daft 0.7.13 on the same KubeRay, DigitalOcean, Gravitino, Iceberg, Lance, and H100 setup, measuring wall time, GPU occupancy, memory, code size, native primitives, concurrency behavior, and completion reliability across eight use cases. Ray Data scored 56 versus Daft’s 47 out of 70 because both engines tied on most multimodal batch workloads, Daft delivered stronger native media ergonomics, and Ray Data won the decisive async LLM inference workload by completing the 50k-email job with enforced concurrency while Daft failed to finish.

https://mehulbatra.medium.com/a-field-journal-on-ray-data-and-daft-for-multimodal-data-lake-e0c26839f5b5

LakeOps: Routing Multiple Query Engines with Iceberg

Multi-engine Iceberg deployments let Spark, Trino, Flink, DuckDB, Athena, Snowflake, and StarRocks share the same tables. Still, they leave query placement, cost control, dialect compatibility, and workload isolation to each client team. LakeOps writes about QueryFlux, a Rust-based SQL routing proxy that supports existing database protocols, selects engine groups via routing rules, translates SQL dialects with sqlglot, enforces concurrency limits, and dispatches queries based on cost, latency, throughput, or health.

https://lakeops.dev/blog/routing-multiple-query-engines-with-iceberg

Rion Williams: Prepare for Launch: Enrichment Strategies for Apache Flink

Flink enrichment jobs become fragile when they depend on referential data that is not already present in state, forcing teams to choose between slow per-record lookups, incomplete early results, or complex state preloading. The article compares external enrichment, CDC-backed gradual enrichment, State Processor API–based two-phase bootstrapping, and gated enrichment as ways to manage the launch-time pain of warming the state before live traffic depends on it.

https://rion.io/2026/01/27/prepare-for-launch-enrichment-strategies-for-apache-flink/

Agoda: How Agoda Simulates Booking Flows to Test Flight Integrations

Agents perform relatively well when there is a faster feedback loop to tune, but unfortunately, in many business cases, the transactions are sensitive. Simulation is the most critical aspect of engineering components, not only for testing but also as an RL environment for Agents. Digital twins are emerging as a critical component of simulation. Agoda describes a simulation system for booking flows to test flight integrations.

https://medium.com/agoda-engineering/how-agoda-simulates-booking-flows-to-test-flight-integrations-204ec4f2e128

Databricks: Introducing Arrow UDFs in PySpark: A Faster, Leaner Replacement for Pandas UDFs

PySpark Pandas UDFs improved Python execution with Arrow serialization and batching, but still incur Pandas/Arrow conversion costs, add memory copies, and struggle with complex data types. Databricks writes about the introduction of native Arrow UDFs, Arrow aggregate functions, Arrow UDTFs, and DataFrame-level mapInArrow and applyInArrow APIs that operate directly on PyArrow arrays, record batches, and tables. Arrow UDFs preserve columnar execution end-to-end, run about 10% faster than Pandas UDFs, use about 40% less memory, and support complex datatype and table-in/table-out transformations with familiar Python and SQL interfaces.

https://www.databricks.com/blog/introducing-arrow-udfs-pyspark-faster-leaner-replacement-pandas-udfs

Giannis Polyzos: When Tables Became the Language of Time

Batch and streaming architectures create duplicated logic, fragmented corrections, and inconsistent state when systems treat logs as the source of truth instead of modeling how data changes over time. The author reframes streams and tables as two views of the same history, using Flink for streaming-first compute, lakehouse commits for durable memory, and Apache Fluss as table-first streaming storage with native updates, deletes, schemas, snapshots, and columnar access. Table-centric streaming collapses batch jobs, replays, late corrections, and real-time dashboards into different temporal views of evolving state, making unification an architectural consequence rather than a glue layer.

https://ipolyzos.substack.com/p/when-tables-became-the-language-of

All rights reserved, Dewpeche Private Limited. I have provided links for informational purposes and do not suggest endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.

Data Engineering Weekly

Discussion about this post

Ready for more?