Try Fully Managed Apache Airflow for FREE
Run Airflow without the hassle and management complexity. Take Astro (the fully managed Airflow solution) for a test drive today and unlock a suite of features designed to simplify, optimize, and scale your data pipelines. For a limited time, new signups will receive a complimentary Airflow Fundamentals Certification exam (normally $150).
Editor’s Note: Fireside chat with Vinoth Chandar (Founder - CEO of OneHouse & chair of Apache Hudi)
Happy New Year 2025 to all our readers. We are back after a short end-of-year break. We published a series of blogs about the state of data engineering in 2024 and our predictions for 2025 in data here and here.
One of DEW’s predictions is LakeDB, where we predicted the LakeHouse Formats would move from mere standards to fully functional databases. I’m excited to start this year with a conversation with Vinoth Chandar - Founder and CEO of OneHouse and PMC chair of Apache Hudi. I’m curious to know his thought process in an environment where everyone declared Apache Iceberg win the future.
Join me in a conversation with Vinoth on the future of LakeHouses.
https://www.onehouse.ai/webinar/bridging-the-gap-a-database-experience-on-the-data-lake
Andy Pavlo: Databases in 2024: A Year in Review
But VCs want that money back and their trap full, so these companies turn out a hosted service for their DBMSs on the cloud. But the cloud makes open-source DBMSs a tricky business. If a system becomes too popular, then a cloud vendor (like Amazon) slaps it up as a service and makes more money than the company paying for the development of the software.
This is an excellent summarization of innovators’ dilemma in the tech industry for all the systems built on top of the cloud providers. S3Table is the latest example of cloud vendors offering popular services.
https://www.cs.cmu.edu/~pavlo/blog/2025/01/2024-databases-retrospective.html
Latent Space: The 2025 AI Engineer Reading List
If you are looking to explore AI Engineering in 2025, the author compiled an excellent list of papers to read through to understand Frontier LLMs, Benchmarks and Evals, Prompting, ICL & Chain of Thought, Retrieval Augmented Generation, Agents, Code Generation, Vision, Voice, Image/Video Diffusion and Finetuning.
https://www.latent.space/p/2025-papers
Anthropic: Building effective agents
Agents, all the way, is one of the hot predictions of 2025. Anthropic provides an excellent overview of building effective agents and has published a cookbook on agent-building patterns on Github.
https://www.anthropic.com/research/building-effective-agents
Sponsored: Apache Airflow® ETL/ELT Patterns: 7 DAG Code Examples
Download this eBook for 7 full example DAG patterns for different ETL and ETL scenarios.
Including:
→ ETL pattern DAG using Dynamic Task Mapping
→ ELT pattern DAG using explicit external storage
→ ETL DAG using XCom
🔗 Access the GitHub repository containing a fully functional Airflow project with all 7 DAGs configured to run out-of-the-box
Microsoft Research: PromptWizard: The future of prompt optimization through feedback-driven self-evolving prompts
Effective prompts are often labor-intensive and domain-specific, requiring significant expertise and time. PromptWizard aims to automate and simplify this process by introducing a framework that utilizes a self-evolving, self-adaptive mechanism. Microsoft Research narrates how PromptWizard optimizes instructions and in-context learning examples, enabling continuous improvement through feedback and synthesis for tailored task optimization.
Github: https://github.com/microsoft/PromptWizard
LinkedIn: Automated GenAI-driven search quality evaluation
Traditionally, assessing search relevance required extensive human judgment, which was both time-consuming and resource-intensive. LinkedIn attempts to solve this by curating a golden dataset for LLM to train on search relevance assessment using typeahead functionality. The case study is a classic example of AI Engineering becoming a basic skill similar to SQL or Data Structures in the industry.
https://www.linkedin.com/blog/engineering/ai/automated-genai-driven-search-quality-evaluation
Netflix: A Survey of Analytics Engineering Work at Netflix
Netflix wrote a two-part summary of its internal analytical engineering summit. The blogs give a sneak peek into Netflix's internal analytical engineering practices, including creating its semantic layer data junction, democratizing analytics with a Slack Bot analytical framework, foundation data platform, and analytical models to measure signups, user journey mapping, and user acquisition.
Part 1: https://netflixtechblog.com/part-1-a-survey-of-analytics-engineering-work-at-netflix-d761cfd551ee
Part 2: https://netflixtechblog.com/part-2-a-survey-of-analytics-engineering-work-at-netflix-4f1f53b4ab0f
Zillow: Zkafka by Zillow: Go Library for Simplifying Kafka Consumption
Zillow introduces Zkafka, a Go library designed to simplify and enhance Kafka message consumption. Traditional Kafka processing faces challenges like head-of-line blocking and inefficient scaling tied to physical partitions. Zkafka addresses these with virtual partitions, which enable concurrent processing within a single Kafka partition while preserving message ordering. By leveraging lightweight goroutines for virtual partitions and implementing key-based routing and sequential offset commits, Zkafka ensures both scalability and reliability.
https://www.zillow.com/tech/zkafka-by-zillow-go-library-for-simplifying-kafka-consumption/
Alibaba: Why Fluss? Top 4 Challenges of Using Kafka for Real-Time Analytics
Kafka is undeniably one of the most robust open-source systems. Its wide adoption powers many mission-critical systems. In many system design interviews, candidates mention Kafka as a default component.
Is Kafka the appropriate component for an analytical engineer? Real-time analytics often requires an OLAP system or stream processing engines like Apache Flink. Fluss is an exciting open-source system I’m looking forward to in 2025 that is trying to address these challenges from Apache Kafka.
All rights reserved ProtoGrowth Inc, India. I have provided links for informational purposes and do not suggest endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employer” opinions.