Data Engineering Weekly #204

The Weekly Data Engineering Newsletter

Jan 20, 2025

Try Fully Managed Apache Airflow for FREE

Astro is the fully-managed DataOps platform powered by Apache Airflow. With Astro, you can build, run, and observe your data pipelines in one place, ensuring your mission critical data is delivered on time.

Try Astro Free →

Julia Wiesinger, Patrick Marlow, and Vladimir Vuskovic: Agents

The combination of reasoning, logic, and access to external information that are all connected to a Generative AI model invokes the concept of an agent.

The white paper explores the definition of agents, various components in agents, and an example code.

https://www.kaggle.com/whitepaper-agents

Jack Vanlightly: AI Agents in 2025

AI Agent is a fast-moving discipline, and as with any rapid development discipline, it is hard to keep track of the progress. The author did an excellent job summarizing the AI Agents from various recent blogs.

https://jack-vanlightly.com/blog/2025/1/16/ai-agents-in-2025

Eric Flaningam: The Unstructured Data Landscape

With structured data, we try to understand business and predict its trajectory. The LLM and growing focus on processing unstructured data allow businesses to automate operations with data.

The author captures the current landscape of processing unstructured data landscape.

https://www.generativevalue.com/p/the-unstructured-data-landscape

Ramp: From RAG to Richness - How Ramp Revamped Industry Classification

One of the significance of the foundation model is that it makes a few traditional machine learning functions much simpler. Ramp writes about adopting RAG to classify the internal industrial code to NAICS codes.

https://engineering.ramp.com/industry_classification

eBay: Scaling Large Language Models for e-Commerce: The Development of a Llama-Based Customized LLM

eBay writes about the first hybrid foundation model usage case with an in-house hosted Llama model. There are two reasons why eBay went with Llama, which is more or less true for many infrastructure components at scale.

These services come with considerable costs, making them impractical for businesses like eBay that need fine-tuned, scalable, and cost-effective solutions.
Additionally, relying on third-party models introduces data security risks and limits fine-tuning capabilities based on proprietary data.

https://innovation.ebayinc.com/tech/features/scaling-large-language-models-for-e-commerce-the-development-of-a-llama-based-customized-llm-for-e-commerce/

Wix: The Art of Secure Search: How Wix Mastered PII Data in Vespa Search Engine

Searching over encrypted data is vital when handling sensitive data. I’ve worked previously on the Bring-Your-Own-Key model search engine. The system design for encrypted search is something new and exciting to read.

https://www.wix.engineering/post/the-art-of-secure-search-how-wix-mastered-pii-data-in-vespa-search-engine

LinkedIn: Improving Recruiting Efficiency with a Hybrid Bulk Data Processing Framework

LinkedIn writes about its unique use case for ownership change in asset management and challenges with the pipeline's consistency and visibility. One additional detail that would be helpful is whether the pipeline runs a bulk data movement of these ownership changes or ownership changes via the mapping table.

https://www.linkedin.com/blog/engineering/data-streaming-processing/improving-recruiting-efficiency-with-hybrid-bulk-data-processing-framework

Ankur Tyagi: ZenML vs Flyte vs Metaflow

The article discusses the challenges of managing increasingly complex machine learning (ML) pipelines and compares three popular orchestration tools: ZenML, Flyte, and Metaflow. The article highlights the benefits of using ML workflow and pipeline orchestration tools, such as task automation, scalability, dependency management, pipeline monitoring, reproducibility, and consistency across environments.

Each tool is analyzed based on its architecture, key features, and community feedback. The author recommends ZenML for its modularity and extensibility, Flyte for its scalability and reliability, and Metaflow for its simplicity and ease of use.

https://mlops.community/zenml-vs-flyte-vs-metaflow/

Ian Cook, David Li, Matt Topol: How the Apache Arrow Format Accelerates Query Result Transfer

The article discusses the often-overlooked bottleneck in query processing: the inefficient transfer of results from the source to the client, particularly due to serialization and deserialization (ser/de) overheads. It highlights how the Apache Arrow format addresses this issue through five key attributes:

columnar nature
self-describing and type-safe design
zero-copy capability
streaming support
universality across various programming languages and platforms

https://arrow.apache.org/blog/2025/01/10/arrow-result-transfer/

Murat Demirbas: Use of Time in Distributed Databases (part 5): Lessons learned

Understanding time (handling time from a technical standpoint 😀) is critical in distributed system design and data pipeline management. The author's five-part series highlights time's evolution from a simple ordering mechanism to a tool for coordination and performance optimization.

The author emphasizes that synchronized time serves as an alignment mechanism, enabling consistent snapshots, conflict detection, and fencing to prevent stale operations. The article also discusses the increasing use of time-based speculation for performance gains, the importance of monotonic clocks and hybrid logical clocks (HLCs) for correctness, and emerging trends such as the growing adoption of synchronized clocks and time-based speculation in distributed databases.

https://muratbuffalo.blogspot.com/2025/01/use-of-time-in-distributed-databases_14.html

All rights reserved ProtoGrowth Inc, India. I have provided links for informational purposes and do not suggest endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.

Data Engineering Weekly