Try Fully Managed Apache Airflow for FREE
Astro is the fully-managed DataOps platform powered by Apache Airflow. With Astro, you can build, run, and observe your data pipelines in one place, ensuring your mission critical data is delivered on time.
Julia Wiesinger, Patrick Marlow, and Vladimir Vuskovic: Agents
The combination of reasoning, logic, and access to external information that are all connected to a Generative AI model invokes the concept of an agent.
The white paper explores the definition of agents, various components in agents, and an example code.
https://www.kaggle.com/whitepaper-agents
Jack Vanlightly: AI Agents in 2025
AI Agent is a fast-moving discipline, and as with any rapid development discipline, it is hard to keep track of the progress. The author did an excellent job summarizing the AI Agents from various recent blogs.
https://jack-vanlightly.com/blog/2025/1/16/ai-agents-in-2025
Eric Flaningam: The Unstructured Data Landscape
With structured data, we try to understand business and predict its trajectory. The LLM and growing focus on processing unstructured data allow businesses to automate operations with data.
The author captures the current landscape of processing unstructured data landscape.
https://www.generativevalue.com/p/the-unstructured-data-landscape
Sponsored: The Ultimate Guide to Apache Airflow® DAGs
Download this free 130+ page eBook for everything a data engineer needs to know to take their DAG writing skills to the next level (+ plenty of example code).
→ Understand the building blocks DAGs, combine them in complex pipelines, and schedule your DAG to run exactly when you want it to
→ Write DAGs that adapt to your data at runtime and set up alerts and notifications
→ Scale your Airflow environment
→ Systematically test and debug Airflow DAGs
By the end of the DAG guide, you'll know how to create and manage reliable, complex DAGs using advanced Airflow features such as those in the screenshot 📸.
https://www.astronomer.io/ebooks/dags-definitive-guide/
Ramp: From RAG to Richness - How Ramp Revamped Industry Classification
One of the significance of the foundation model is that it makes a few traditional machine learning functions much simpler. Ramp writes about adopting RAG to classify the internal industrial code to NAICS codes.
https://engineering.ramp.com/industry_classification
eBay: Scaling Large Language Models for e-Commerce: The Development of a Llama-Based Customized LLM
eBay writes about the first hybrid foundation model usage case with an in-house hosted Llama model. There are two reasons why eBay went with Llama, which is more or less true for many infrastructure components at scale.
These services come with considerable costs, making them impractical for businesses like eBay that need fine-tuned, scalable, and cost-effective solutions.
Additionally, relying on third-party models introduces data security risks and limits fine-tuning capabilities based on proprietary data.
Wix: The Art of Secure Search: How Wix Mastered PII Data in Vespa Search Engine
Searching over encrypted data is vital when handling sensitive data. I’ve worked previously on the Bring-Your-Own-Key model search engine. The system design for encrypted search is something new and exciting to read.
LinkedIn: Improving Recruiting Efficiency with a Hybrid Bulk Data Processing Framework
LinkedIn writes about its unique use case for ownership change in asset management and challenges with the pipeline's consistency and visibility. One additional detail that would be helpful is whether the pipeline runs a bulk data movement of these ownership changes or ownership changes via the mapping table.
Ankur Tyagi: ZenML vs Flyte vs Metaflow
The article discusses the challenges of managing increasingly complex machine learning (ML) pipelines and compares three popular orchestration tools: ZenML, Flyte, and Metaflow. The article highlights the benefits of using ML workflow and pipeline orchestration tools, such as task automation, scalability, dependency management, pipeline monitoring, reproducibility, and consistency across environments.
Each tool is analyzed based on its architecture, key features, and community feedback. The author recommends ZenML for its modularity and extensibility, Flyte for its scalability and reliability, and Metaflow for its simplicity and ease of use.
https://mlops.community/zenml-vs-flyte-vs-metaflow/
Ian Cook, David Li, Matt Topol: How the Apache Arrow Format Accelerates Query Result Transfer
The article discusses the often-overlooked bottleneck in query processing: the inefficient transfer of results from the source to the client, particularly due to serialization and deserialization (ser/de) overheads. It highlights how the Apache Arrow format addresses this issue through five key attributes:
columnar nature
self-describing and type-safe design
zero-copy capability
streaming support
universality across various programming languages and platforms
https://arrow.apache.org/blog/2025/01/10/arrow-result-transfer/
Murat Demirbas: Use of Time in Distributed Databases (part 5): Lessons learned
Understanding time (handling time from a technical standpoint 😀) is critical in distributed system design and data pipeline management. The author's five-part series highlights time's evolution from a simple ordering mechanism to a tool for coordination and performance optimization.
The author emphasizes that synchronized time serves as an alignment mechanism, enabling consistent snapshots, conflict detection, and fencing to prevent stale operations. The article also discusses the increasing use of time-based speculation for performance gains, the importance of monotonic clocks and hybrid logical clocks (HLCs) for correctness, and emerging trends such as the growing adoption of synchronized clocks and time-based speculation in distributed databases.
https://muratbuffalo.blogspot.com/2025/01/use-of-time-in-distributed-databases_14.html
All rights reserved ProtoGrowth Inc, India. I have provided links for informational purposes and do not suggest endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.