Annual Report: The State of Apache Airflow® 2025
DataOps on Apache Airflow® is powering the future of business – this report reviews responses from 5,000+ data practitioners to reveal how and what’s coming next.
Editor’s Note: Data Council 2025, Apr 22-24, Oakland, CA
Data Council has always been one of my favorite events to connect with and learn from the data engineering community. Data Council 2025 is set for April 22-24 in Oakland, CA. As a special perk for Data Engineering Weekly subscribers, you can use the code dataeng20 for an exclusive 20% discount on tickets!
https://www.datacouncil.ai/bay-2025
Kristina Nikolova: Tech Layoffs Analysis: Which Skills Are Still in High Demand
There is a growing concern about AI's impact on the knowledge workforce. Understanding which skills are in growing demand and the need for upskilling as the software abstraction changes is critical. I found the blog to be a fresh take on the skill in demand by layoff datasets.
https://semaphore.io/blog/tech-layoffs
Mehdio: DuckDB goes distributed? DeepSeek’s smallpond Takes on Big Data.
DeepSeek continues to impact the Data and AI landscape with its recent open-source tools, such as Fire-Flyer File System (3FS) and smallpond. The blog provides an excellent analysis of smallpond compared to Spark and Daft.
Our internal benchmark of the NYC dataset shows a 48% performance gain of smallpond over Spark!!
https://mehdio.substack.com/p/duckdb-goes-distributed-deepseeks
Sneha Ghantasala: Slow Reads for S3 Files in Pandas & How to Optimize it
DeepSeek’s Fire-Flyer File System (3FS) re-triggers the importance of an optimized file system for efficient data processing. The industry relies more or less on S3 as a de facto data storage, and I found the experimentation on optimizing the S3 read optimization to be an excellent reference.
Sponsored: Datasets & Data-Aware Scheduling in Airflow
Datasets and data-driven scheduling are one of the most adopted features by Airflow users, 48% of the respondents in the recent Airflow survey said they are already using this feature, and nearly 30% are asking for an expansion of it! Whether you use Datasets already or want to get started, we've got you covered!
Hien Luu & Robert Krzaczyński: Prompt Engineering: Challenges, Strengths, and Its Place in Software Development's Future
The blog explores prompt engineering as a bridge between natural language and AI tasks, contrasting it with traditional programming. While prompt engineering’s lower learning curve and accessibility make it a valuable complement, it falls short in precision, reliability, and scalability. The conclusion is that prompt engineering will enhance rather than replace traditional programming long-term.
https://www.infoq.com/articles/prompt-engineering/
Miles Cole: Mastering Spark: The Art and Science of Table Compaction
The blog explores various strategies for table compaction in data engineering, focusing on Delta Lake, Hudi, and Iceberg. It evaluates methods including no compaction, pre-write optimized writes, scheduled compaction, and automatic compaction, ultimately recommending automatic compaction for its simplicity and consistent performance. The author highlights that while scheduled or manual compaction may occasionally still be necessary for larger datasets, enabling automatic compaction reduces complexity and ensures stable read and write performance over time.
https://milescole.dev/data-engineering/2025/02/26/The-Art-and-Science-of-Table-Compaction.html
Netflix: Cloud Efficiency at Netflix
Data is the Key
Optimization starts with collecting data and asking the right questions. Netflix writes an excellent article describing its approach to cloud efficiency, starting with data collection to questioning the business process.
https://netflixtechblog.com/cloud-efficiency-at-netflix-f2a142955f83
Adevinta: From Lakehouse architecture to data mesh
One of DEW’s 2025 predictions is that we will see increased adoption of the data Mesh principles. Adevinta writes about transforming its data infrastructure from a lakehouse architecture to a data mesh, leveraging Databricks and initiatives like data contracts and data product frameworks. Key highlights include
Using data contracts for source-aligned data products (bronze layer).
Creating "one big table" for domain-aggregated data products (silver layer)
Implementing a "consuming suite" for consumer-aligned data products and prototyping (gold layer) automates governance and enables scalable, decentralized data product creation.
https://medium.com/adevinta-tech-blog/from-lakehouse-architecture-to-data-mesh-c532c91f7b61
CloudFlare: Over 700 million events/second: How we make sense of too much data
Cloudflare writes about how it manages and extracts value from a massive data pipeline ingesting over 700 million events per second, using controlled downsampling (via techniques like "bottomless buffers" and adaptive sampling) to handle potential data loss. The blog explains how the Horvitz-Thompson estimator is used to derive accurate analytics and confidence intervals from sampled data, illustrates a real-world example of how incorrect sampling can lead to biased results, and describes how these techniques are exposed in its analytics APIs.
https://blog.cloudflare.com/how-we-make-sense-of-too-much-data/
State Farm: When the Spark Execution Plan Gets Too Big
State Farm writes about strategies for handling large Apache Spark execution plans caused by extensive data transformations. The blog compares caching, checkpointing, local checkpointing, temporary writes, and rebuilding from RDDs, emphasizing their impact on performance, fault tolerance, and memory usage. The analysis shows that while local checkpointing is often the most efficient, checkpointing or temporary writes offer more reliability.
https://engineering.statefarm.com/when-the-spark-execution-plan-gets-too-big-eb658872d603
All rights reserved ProtoGrowth Inc, India. I have provided links for informational purposes and do not suggest endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.