Data Engineering Weekly #243

The Weekly Data Engineering Newsletter

Oct 27, 2025

How Supplyco Powers Real-Time Manufacturing Intelligence with Dagster

On Tuesday, Supplyco’s CTO Claudia Richoux will reveal how they built a pipeline in Dagster that processes 100,000+ data streams in real time — while ensuring 99.99% uptime.

You’ll see their DAG architecture in action, learn how they built observability into every layer, and how they treat “data as code” to ship fast and scale smart.

Save your spot now

Editor’s Note: Re:Govern - The Data & AI Context Summit

Data engineering is evolving, without any clear playbook for the AI era. Yet, some AI-forward data leaders are years ahead by iterating fast and learning even faster.

I’m looking forward to Re:Govern 2025 on November 5, where leaders from Mastercard, GitLab, General Motors, Elastic, and others will share what it really takes to build for this next phase — from investing in new technologies like semantic layers to reimagining new operating models, skills, and roles for the AI era.

Register here. This is one event you don’t want to miss.

Shiyan Xu: Apache Kafka^® (Kafka Connect) vs. Apache Flink^® vs. Apache Spark^™: Choosing the Right Ingestion Framework

The article contrasts Kafka Connect, Flink (incl. Flink CDC), and Spark: Kafka Connect excels at connector-driven CDC and fan-out to sinks, Flink delivers low-latency, stateful stream processing and direct CDC-to-lakehouse writes, and Spark dominates batch/backfills and complex transformations with Structured Streaming for micro-batch use. The takeaway from the author: use Kafka Connect (with Kafka) for broad CDC integrations, Flink for real-time/event-time pipelines, and Spark for batch and heavy transforms—while watching Spark’s idle/scale inefficiencies to control spend.

https://www.onehouse.ai/blog/kafka-connect-vs-flink-vs-spark-choosing-the-right-ingestion-framework

Jack Vanlightly: A Fork in the Road: Deciding Kafka’s Diskless Future

Rising cross-AZ replication costs have pushed the Kafka community to debate a shift toward S3-backed, diskless architectures. The article compares competing KIPs—KIP-1176, Aiven’s Inkless fork, and KIP-1150’s revisions—framing a choice between evolutionary designs that retrofit S3 into Kafka’s leader-follower model and revolutionary designs that rebuild Kafka around stateless, direct-to-S3 brokers. The outcome will define Kafka’s next decade, deciding whether it remains a stateful streaming system or evolves into a cloud-native, elastic platform built on object storage.

https://jack-vanlightly.com/blog/2025/10/22/a-fork-in-the-road-deciding-kafkas-diskless-future

Robin Moffatt: flink-watermarks

In stream processing, watermarks play a crucial role in handling out-of-order events and determining when windows can be closed by tracking the progress of event time. The key challenge is striking the right balance between minimizing processing latency and ensuring data completeness—two objectives that are fundamentally at odds. I found the microlearning-style explanation of the watermark to be an excellent way to learn about Flink’s watermark implementation.

https://flink-watermarks.wtf/

Datology: Datology’s distributed pipelines for handling PBs of image data

As AI advances outpace the availability of fresh internet-scale training data, it is critical to rethink how massive datasets are curated rather than expanded. TDatology writes about building a distributed, PB-scale pipeline for deduplication, filtering, and clustering of image-text pairs using custom Spark-based joins, a forked Flyte orchestrator, and a Postgres-backed data catalog to support reproducible, modular research across billions of records.

https://www.amplifypartners.com/blog-posts/datologys-distributed-pipelines-for-handling-pbs-of-image-data

DV Engineering: Replaying massive data in a non-production environment using Pekko Streams and Kubernetes Pekko Cluster

Replaying real production traffic at scale is critical for building a testing environment, a challenge I failed multiple times. DV writes about building a Traffic Replay Tool using Pekko Streams to throttle and stabilize replay rates for functional tests and Pekko Cluster to distribute and scale replay workloads for load testing across Kubernetes. By combining backpressure-aware streaming with distributed data replication, DoubleVerify created a consistent, scalable replay framework that mirrors production behavior while safely supporting both performance and integration testing.

https://medium.com/@dv-engineering/replaying-massive-data-in-a-non-production-environment-using-pekko-streams-and-kubernetes-pekko-e1a6b5f04a48

Pinterest: Identify User Journeys at Pinterest

Understanding user intent over time is difficult when recommendation systems focus only on short-term interests. Pinterest writes about building user journeys—clusters of user interactions that capture evolving goals such as “planning a wedding” or “remodeling a kitchen”—using dynamic keyword extraction, clustering, and ranking models, enriched by LLM-based naming and evaluation. This journey-aware system improved click rates by up to 88%, marking a shift from static interest recommendations to long-term, goal-driven personalization.

https://medium.com/pinterest-engineering/identify-user-journeys-at-pinterest-b517f6275b42

Uber: Requirement Adherence: Boosting Data Labeling Quality Using LLMs

Ensuring high-quality labeled data at scale is challenging when quality checks occur only after annotation. Uber writes about building Requirement Adherence, an LLM-powered in-tool validation system in uLabel that extracts labeling rules from client SOPs and enforces them in real time through atomic, parallelized checks. This framework reduced post-labeling audits by 80%, improving turnaround time, cost efficiency, and data quality while setting a foundation for adaptive, feedback-driven prompt optimization.

https://www.uber.com/en-IN/blog/requirement-adherence-boosting-data-labeling-quality-using-llms/

Meta: Scaling Privacy Infrastructure for GenAI Product Innovation

Gen AI introduces complex data flows that demand privacy-by-design infrastructure at a massive scale. Meta writes about its Privacy-Aware Infrastructure (PAI), which embeds automated data lineage, observability, and policy enforcement directly into GenAI product development—illustrated through its AI glasses use case. By mapping data movement end-to-end and enforcing policy zones across training and inference pipelines, PAI enables rapid GenAI innovation while maintaining verifiable privacy, transforming compliance from a constraint into an accelerator for responsible product development.

https://engineering.fb.com/2025/10/23/security/scaling-privacy-infrastructure-for-genai-product-innovation/

Han Fang, Karthik Abinav Sankararaman: Post-training 101

Transforming a pre-trained LLM into an aligned, instruction-following assistant requires careful post-training to balance capability with safety and usefulness. This guide walks through the full lifecycle—from Supervised Fine-Tuning (SFT) for teaching models to follow instructions, to Reinforcement Learning methods like RLHF, RLAIF, and RLVR that optimize for preferences or verifiable outcomes, and finally to robust evaluation frameworks combining ground-truth, LLM-judge, and human assessments.

https://tokens-for-thoughts.notion.site/post-training-101

All rights reserved, Dewpeche Pvt Ltd, India. I have provided links for informational purposes and do not suggest endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.

Data Engineering Weekly