Data Engineering Weekly #210

The Weekly Data Engineering Newsletter

Mar 03, 2025

Annual Report: The State of Apache Airflow® 2025

DataOps on Apache Airflow® is powering the future of business – this report reviews responses from 5,000+ data practitioners to reveal how and what’s coming next.

Get the report →

Editor’s Note: Data Council 2025, Apr 22-24, Oakland, CA

Data Council has always been one of my favorite events to connect with and learn from the data engineering community. Data Council 2025 is set for April 22-24 in Oakland, CA. As a special perk for Data Engineering Weekly subscribers, you can use the code dataeng20 for an exclusive 20% discount on tickets!

https://www.datacouncil.ai/bay-2025

Kristina Nikolova: Tech Layoffs Analysis: Which Skills Are Still in High Demand

There is a growing concern about AI's impact on the knowledge workforce. Understanding which skills are in growing demand and the need for upskilling as the software abstraction changes is critical. I found the blog to be a fresh take on the skill in demand by layoff datasets.

https://semaphore.io/blog/tech-layoffs

Mehdio: DuckDB goes distributed? DeepSeek’s smallpond Takes on Big Data.

DeepSeek continues to impact the Data and AI landscape with its recent open-source tools, such as Fire-Flyer File System (3FS) and smallpond. The blog provides an excellent analysis of smallpond compared to Spark and Daft.

Our internal benchmark of the NYC dataset shows a 48% performance gain of smallpond over Spark!!

https://mehdio.substack.com/p/duckdb-goes-distributed-deepseeks

Sneha Ghantasala: Slow Reads for S3 Files in Pandas & How to Optimize it

DeepSeek’s Fire-Flyer File System (3FS) re-triggers the importance of an optimized file system for efficient data processing. The industry relies more or less on S3 as a de facto data storage, and I found the experimentation on optimizing the S3 read optimization to be an excellent reference.

https://medium.com/tr-labs-ml-engineering-blog/slow-reads-for-s3-files-in-pandas-how-to-optimize-it-c3bfdb947a70

Hien Luu & Robert Krzaczyński: Prompt Engineering: Challenges, Strengths, and Its Place in Software Development's Future

The blog explores prompt engineering as a bridge between natural language and AI tasks, contrasting it with traditional programming. While prompt engineering’s lower learning curve and accessibility make it a valuable complement, it falls short in precision, reliability, and scalability. The conclusion is that prompt engineering will enhance rather than replace traditional programming long-term.

https://www.infoq.com/articles/prompt-engineering/

Miles Cole: Mastering Spark: The Art and Science of Table Compaction

The blog explores various strategies for table compaction in data engineering, focusing on Delta Lake, Hudi, and Iceberg. It evaluates methods including no compaction, pre-write optimized writes, scheduled compaction, and automatic compaction, ultimately recommending automatic compaction for its simplicity and consistent performance. The author highlights that while scheduled or manual compaction may occasionally still be necessary for larger datasets, enabling automatic compaction reduces complexity and ensures stable read and write performance over time.

https://milescole.dev/data-engineering/2025/02/26/The-Art-and-Science-of-Table-Compaction.html

Netflix: Cloud Efficiency at Netflix

Data is the Key

Optimization starts with collecting data and asking the right questions. Netflix writes an excellent article describing its approach to cloud efficiency, starting with data collection to questioning the business process.

https://netflixtechblog.com/cloud-efficiency-at-netflix-f2a142955f83

Adevinta: From Lakehouse architecture to data mesh

One of DEW’s 2025 predictions is that we will see increased adoption of the data Mesh principles. Adevinta writes about transforming its data infrastructure from a lakehouse architecture to a data mesh, leveraging Databricks and initiatives like data contracts and data product frameworks. Key highlights include

Using data contracts for source-aligned data products (bronze layer).
Creating "one big table" for domain-aggregated data products (silver layer)
Implementing a "consuming suite" for consumer-aligned data products and prototyping (gold layer) automates governance and enables scalable, decentralized data product creation.

Layers in the lakehouse, following the medallion architecture. The bronze, silver and gold layers all control different aspects of the architecture and data products

https://medium.com/adevinta-tech-blog/from-lakehouse-architecture-to-data-mesh-c532c91f7b61

CloudFlare: Over 700 million events/second: How we make sense of too much data

Cloudflare writes about how it manages and extracts value from a massive data pipeline ingesting over 700 million events per second, using controlled downsampling (via techniques like "bottomless buffers" and adaptive sampling) to handle potential data loss. The blog explains how the Horvitz-Thompson estimator is used to derive accurate analytics and confidence intervals from sampled data, illustrates a real-world example of how incorrect sampling can lead to biased results, and describes how these techniques are exposed in its analytics APIs.

https://blog.cloudflare.com/how-we-make-sense-of-too-much-data/

State Farm: When the Spark Execution Plan Gets Too Big

State Farm writes about strategies for handling large Apache Spark execution plans caused by extensive data transformations. The blog compares caching, checkpointing, local checkpointing, temporary writes, and rebuilding from RDDs, emphasizing their impact on performance, fault tolerance, and memory usage. The analysis shows that while local checkpointing is often the most efficient, checkpointing or temporary writes offer more reliability.

https://engineering.statefarm.com/when-the-spark-execution-plan-gets-too-big-eb658872d603

All rights reserved ProtoGrowth Inc, India. I have provided links for informational purposes and do not suggest endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.

Data Engineering Weekly