Data Engineering Weekly #239

The Weekly Data Engineering Newsletter

Sep 29, 2025

The data platform playbook everyone’s using

We wrote an eBook on Data Platform Fundamentals to help you be like the happy data teams, operating undering a single platform.

In this book, you’ll learn:
- How composable architectures allow teams to ship faster
- Why data quality matters and how you can catch issues before they reach users
- What observability means, and how it will help you solve problems more quickly

Get the guide now

Editor’s Note: MLOps World | GenAI summit - 2025

The MLOps World | GenAI Summit will be hosted on October 8-9, featuring over 60 sessions from prominent companies, including OpenAI and HuggingFace, among others. Members can redeem 150$ off tickets. Sessions are the real deal, featuring practical workshops, use cases, food, drink, and parties throughout Austin.

You can join to see more here: DataEngineeringWeekly (150$ OFF)

InfoQ: InfoQ AI, ML, and Data Engineering Trends Report - 2025

InfoQ’s 2025 Trends Report highlights that the center of gravity is shifting toward Physical AI and robust agent ecosystems—powered by multimodal and on-device models, interoperable protocols like MCP/A2A, and AI-driven DevOps. The report promotes vector DBs / MLOps / synthetic data into mainstream use, highlights emerging areas (reasoning models, AI DevOps), and predicts that the near term will favor the agentic co-creation of software, real-time video RAG, and quietly embedded, context-aware AI experiences.

https://www.infoq.com/articles/ai-ml-data-engineering-trends-2025/

StreamNative: Latency Numbers Every Data Streaming Engineer Should Know

One of the informative articles for me this week. Teams misuse “real-time” because they don’t budget for physics: disks, networks, replication, and table commit cadences each add hard milliseconds to minutes of delay. The article classifies latency tiers (<5 ms, 5–100 ms, >100 ms), quantifies common costs (fsync, cross-AZ/region hops, and Iceberg commit intervals), and contrasts synchronous vs. asynchronous paths with tuning levers such as batching, acknowledgments, and polling.

https://streamnative.io/blog/latency-numbers-every-data-streaming-engineer-should-know

Meta: Behind the Numbers: How Meta’s Data Engineers Bring Transparency to Life

Enterprises face increasing regulatory scrutiny and must transform ambiguous policies into precise, auditable metrics at a massive scale. Meta builds an upstream governance layer, a canonical “Integrity Core Data” model of the enforcement funnel, and snapshot-versioned reporting pipelines with monitoring and re-runs to centralize logging, unify taxonomy, and enforce reportable behavior by design. The design shifts transparency from retrospective reporting to product guardrails, enabling consistent CSER/DSA metrics, faster onboarding of new surfaces, and a roadmap toward automated, self-serve, canonical metrics.

https://medium.com/@AnalyticsAtMeta/behind-the-numbers-how-metas-data-engineers-bring-transparency-to-life-6bfd75315fb0

Netflix: Scaling Muse: How Netflix Powers Data-Driven Creative Insights at Trillion-Row Scale

As data volumes grow, analytics applications struggle to balance advanced features, low latency, and accuracy at scale. Netflix writes about evolving its Muse architecture with HyperLogLog sketches for approximate distinct counts, Hollow in-memory feeds for precomputed aggregates, and deep Druid optimizations to cut query latency by ~50% while offloading concurrency pressure.

https://netflixtechblog.com/scaling-muse-how-netflix-powers-data-driven-creative-insights-at-trillion-row-scale-aa9ad326fd77

Uber: Uber’s Strategy to Upgrading 2M+ Spark Jobs

Upgrading large-scale data infrastructure often stalls on dependency complexity, code rewrites, and the risk of breaking production workloads. Uber writes about migrating over 2 million daily Spark jobs from version 2.4 to 3.3 by rebasing its Spark fork, automating code changes with Polyglot Piranha, and validating behavior with a custom shadow testing framework that safely reroutes outputs for comparison. The result of a full migration in six months is impressive, with 50% runtime and resource reductions, millions in cost savings, and reusable automation that now accelerates future upgrades across Uber’s data platform.

https://www.uber.com/en-IN/blog/ubers-strategy-to-upgrading-2m-spark-jobs/

Grab: Powering Partner Gateway metrics with Apache Pinot.

Real-time partner integrations require low-latency analytics at scale; however, large datasets and complex aggregations often exceed query times beyond SLA thresholds. Grab writes about addressing this in its Partner Gateway by adopting Apache Pinot with Kafka–Flink ingestion, partitioning by metric, rounded time columns, and Star-tree indexing to cut query latency from tens of seconds to under 300 ms, even across billions of rows. This design trades modestly higher storage costs for sub-second insights, enabling partners to monitor APIs, resolve issues quickly, and operate with greater efficiency and reliability.

https://engineering.grab.com/pinot-partnergateway-tech-blog

Airbnb: Building a Next-Generation Key-Value Store at Airbnb

A real-time serving layer with real-time ingestion and bulk upload is critical for feature stores, fraud detection, and other applications. Airbnb writes about building the next-gen key-value store to support both real-time ingestion and bulk upload with sub-second latency serving. The blog details how the previous generation system was unable to adopt the new requirements and how it was optimized with the newer system.

https://medium.com/airbnb-engineering/building-a-next-generation-key-value-store-at-airbnb-0de8465ba354

Scribd: Building a Scalable Data Warehouse Backup System with AWS

Backing up petabyte-scale S3 data warehouses is costly and error-prone if every job copies full datasets and deletes files aggressively. Scribe writes about a re-architected backup with a hybrid Lambda/Fargate pipeline that performs incremental copies of only new Parquet files (while always preserving delta logs), validates manifests, and stores results in Glacier. A safe deletion workflow tags stale files for lifecycle cleanup instead of immediate removal, yielding automated, resilient backups that cut costs, shrink runtimes, and scale seamlessly from small to massive datasets.

https://tech.scribd.com/blog/2025/building-scalable-data-warehouse-backup-system.html

Dipankar Mazumdar: Apache Parquet vs. Newer File Formats (BtrBlocks, FastLanes, Lance, Vortex)

Parquet remains the backbone of data lakes, but its batch-first design struggles with today’s AI and vector search workloads that demand low-latency random access and hardware-aware execution. The author walks through how newer formats are rethinking storage: FastLanes brings SIMD/GPU-friendly lightweight compression chains, Lance optimizes NVMe-backed random lookups, Nimble (from Meta) focuses on ML feature stores with extremely wide schemas, and Vortex extends Arrow with compressed arrays for orders-of-magnitude faster access.

https://dipankar-tnt.medium.com/apache-parquet-vs-newer-file-formats-btrblocks-fastlanes-lance-vortex-cdf02130182c

All rights reserved, Dewpeche Pvt Ltd, India. I have provided links for informational purposes and do not suggest endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.

Data Engineering Weekly