Data Engineering Weekly #268

The Weekly Data Engineering Newsletter

May 04, 2026

Free Course: AI-Driven Data Engineering

AI coding agents are changing how data engineers work. This Dagster University course shows how to build a production-ready ELT pipeline from prompts while learning practical patterns for reliable AI-assisted development.

This course is designed for engineers exploring agentic coding workflows and engineers who want to learn Dagster or become Dagster power users

Get started now

Event Highlight: Don't Miss AI Council

- The technical conference for humans who ship

Join the people actually building AI & data infrastructure - and hear them share what’s working, what broke in prod, and what’s coming next. May 12–14 in San Francisco.

Speakers include: The co-inventor of ChatGPT. Creator of DuckDB. Creator of Codex. Engineers from ClickHouse, Databricks, Datadog, and LangChain.

→ Save 20% on your ticket with code DATAEW20 through 5/5

Grab: Data Mesh at Grab Part II: The Foundational Tools behind Certification

Scaling a data mesh across hundreds of thousands of assets demands enforceable trust, consistent governance, and reliable quality without centralized bottlenecks. Grab operationalizes certification through three integrated platforms — Hubble for metadata and event-driven certification, Genchi for continuous quality validation, and a Data Contract Registry enforcing producer-consumer agreements. The system converts data mesh principles into a metadata-driven workflow, reducing dataset sprawl and driving convergence toward certified, reusable assets anchored to analytics and AI workloads.

https://engineering.grab.com/data-mesh-2

Doug Turnbull: Can agents replace the search stack?

Search APIs depend on layered pipelines for query understanding, retrieval, and reranking, creating complexity that struggles to adapt across heterogeneous user intents. The author writes about testing GPT-5 and GPT-5-mini agents with basic BM25 and e5 embedding tools on Amazon ESCI, lifting NDCG from 0.289 to 0.453 — exploration prompts and duplicate-query rejection further closed the gap on smaller models. The findings reframe retrieval for product-style “finding things” workloads as an agent-driven loop. However, knowledge-gap tasks like MSMarco still favor traditional embedding stacks anchored in dense-retrieval quality.

https://softwaredoug.com/blog/2026/04/28/search-apis-replaced-by-agents

Pinterest: Optimizing ML Workload Network Efficiency (Part I): Feature Trimmer

Root-leaf ML serving architectures unlock GPU specialization but bottleneck on network bandwidth when fanning out feature payloads to partitioned model inference. Pinterest writes about building Feature Trimmer, a “Send What You Use” system that treats the model signature as the source of truth — version-aware allowlists sync with bundle deployments through the same staged rollout, fallback, and concurrency safeguards as model configs.

https://medium.com/pinterest-engineering/optimizing-ml-workload-network-efficiency-part-i-feature-trimmer-ae20beb08d69

Pinterest: From Clicks to Conversions: Architecting Shopping Conversion Candidate Generation at Pinterest

Optimizing ad retrieval for offsite conversions requires modeling sparse, noisy, delayed signals that engagement-based candidate generators were never designed to surface. Pinterest writes about built a two-tower shopping conversion retrieval model that unifies conversions and click-duration-weighted engagement under a single multi-task head, paired with a parallel DCN v2 and MLP cross-layer architecture, and an advertiser-level loss to stabilize sparse Pin-level supervision.

https://medium.com/pinterest-engineering/from-clicks-to-conversions-architecting-shopping-conversion-candidate-generation-at-pinterest-04cae5e1455b

FiveTran: How we accelerated transpilation by compiling SQLGlot with mypyc

Fivetran writes about compiled SQLGlot with mypyc, transpiling type-annotated Python into C extensions, contributing five upstream string primitives, and inlining hot paths like sentinel tokens, native i64 integers, and pre-built dispatch dictionaries — all while preserving the pure Python install path. The compiled sqlglot[c] package accelerates parsing by 5x, SQL generation by 2.5x, and optimizer passes by 2-2.5x, anchored to a dual-distribution model that keeps SQL transpilation portable across multi-engine data lake architectures.

https://www.fivetran.com/blog/how-we-accelerated-transpilation-by-compiling-sqlglot-with-mypyc

Robin Moffatt: Materialized Tables in Apache Flink

Streaming SQL frameworks split table definitions from the population logic, leaving INSERT jobs orphaned across restarts and forcing operators to manage schema evolution and lifecycle as separate concerns. The author walks through Flink 2.2’s Materialized Tables, which bind the refresh query to the table definition and support CONTINUOUS or FULL refresh modes, partition-scoped reloads, suspend/resume via savepoints, and unified batch-streaming semantics through a single FRESHNESS parameter. The construct collapses three legacy patterns — CREATE/INSERT, CTAS, and external schedulers — into a single durable object. However, catalog support beyond Paimon and the embedded scheduler remains anchored in early-stage maturity gaps.

https://rmoff.net/2026/04/28/materialized-tables-in-apache-flink/

Alexey Makhotkin: 5NF and Database Design

Traditional database normalization tutorials present 5NF through contrived table-splitting exercises that obscure the underlying business logic and leave practitioners unable to apply it in practice. The author reframes 5NF design around two logical patterns — the AB-BC-AC triangle for independent M: N relationships across three anchors, and the ABC+D star pattern, where a fourth entity binds three 1:N links — thereby driving table construction directly from business requirements rather than from post hoc decomposition. The approach replaces 5NF reasoning with a deterministic logical-to-physical workflow, anchored to anchor-link modeling that produces normalized schemas without invoking decomposition theorems.

https://kb.databasedesignbook.com/posts/5nf/

ultrathink: SQLite in Production: Lessons from Running a Store on a Single File

Single-file embedded databases promise operational simplicity, but their filesystem-level locking model breaks down when modern container orchestration introduces concurrent writers across overlapping deploys. Ultrathink writes about running a production e-commerce store on Rails 8 with four SQLite databases on a shared Docker volume, diagnosing lost orders through sqlite_sequence after eleven rapid Kamal blue-green deploys caused overlapping containers to corrupt WAL writes despite successful Stripe charges.

https://ultrathink.art/blog/sqlite-in-production-lessons

Capital One: Spark tuning: executor optimization for performance

Spark applications often underperform when executors are configured by default, leaving CPU cores and memory underutilized while introducing fault-tolerance risks and network overhead across worker nodes. Capital One Tech walks through executor sizing trade-offs — fat executors maximize data locality but concentrate failure risk, thin executors improve parallelism but flood the network — and codifies an optimal configuration recipe reserving cores and memory for OS overhead, capping executors at 3-5 cores, and accounting for the 384 MB or 10% memory overhead. The framework converts executor tuning from guesswork into a deterministic sizing exercise, anchored to balanced parallelism, fault tolerance, and resource utilization across distributed Spark clusters.

https://medium.com/capital-one-tech/spark-tuning-executor-optimization-for-performance-c757b39f0efe

All rights reserved, Dewpeche Private Limited. I have provided links for informational purposes and do not suggest endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.

Data Engineering Weekly

Discussion about this post

Ready for more?