Data Engineering Weekly #208

The Weekly Data Engineering Newsletter

Feb 16, 2025

Automate Airflow deploys with built-in CI/CD.

Streamline code deployment, enhance collaboration, and ensure DevOps best practices with Astro's robust CI/CD capabilities.

Try Astro Free →

Sebastian Raschka: Understanding Reasoning LLMs

The reasoning capabilities of LLM open up building learning agents. This article discusses reasoning models, a specialization of LLMs for complex tasks requiring multi-step generation. The author outlines four key approaches to building these models: inference-time scaling, pure reinforcement learning, supervised finetuning with reinforcement learning, and distillation via supervised finetuning. The article also highlights DeepSeek R1 as a milestone in open-weight reasoning models and emphasizes that effective, budget-friendly strategies, like distillation and journey learning, enable smaller-scale research.

https://magazine.sebastianraschka.com/p/understanding-reasoning-llms

Maarten Grootendorst: A Visual Guide to Reasoning LLMs

This article provides another exciting explanation of reasoning capabilities in LLM. It explores reasoning LLMs and highlights the shift from scaling train-time compute to test-time compute for improved performance. The author visually explains techniques like Chain-of-Thought, search against verifiers, and modifying proposal distributions, using DeepSeek-R1 as a key example. The article also emphasizes DeepSeek-R1's training pipeline focused on reinforcement learning and touches upon the distillation of smaller models and even unsuccessful attempts.

https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-reasoning-llms

Chip Huyen: Common pitfalls when building generative AI applications

Enterprises are increasingly trying to build application capabilities to leverage Gen-AI capabilities. This author outlines common pitfalls in building generative AI applications, including unnecessarily using generative AI, mistaking product issues for AI flaws, starting with overly complex solutions, and overestimating early success. The blog highlights the over-reliance on AI for evaluation instead of human input and crowdsourcing use cases without a comprehensive strategy.

https://huyenchip.com/2025/01/16/ai-engineering-pitfalls.html

LinkedIn: Building collaborative prompt engineering playgrounds using Jupyter Notebook

Prompt engineering is a fundamental aspect of leveraging LLMs, representing a significant shift in how we interact with technology. However, developing customer-ready features requires a custom setup that integrates smoothly with the development environment and its requirements. LinkedIn writes about how it built the prompt playground using Jupyter Notebook to set the baseline model.

https://www.linkedin.com/blog/engineering/product-design/building-collaborative-prompt-engineering-playgrounds-using-jupyter-notebook

Alex Milowski: A Survey of Workflow Orchestration Systems

Workflow orchestration is a core component in a business, ranging from business process automation, data pipeline, and AI/ML workload. It is interesting to see a strong trend to use YAML as a syntax for describing the graph of tasks in the workflow DSL.

https://mlops.community/a-survey-of-workflow-orchestration-systems/

Netflix: Introducing Impressions at Netflix

High-quality activity tracking is vital for a data-driven organization. Netflix writes about its impression tracking system, which captures user interactions with content previews to enhance personalization. The blog describes the system's architecture, including collecting and processing raw events via Apache Kafka and Apache Flink, enriching them, and storing them in Apache Iceberg. The article also highlights their data quality measures.

https://netflixtechblog.com/introducing-impressions-at-netflix-e2b67c88c9fb

PayPal: Estimating Incremental Lift in Customer Value (Delta CV) using Synthetic Control

PayPal writes about using "Delta CV" (Delta Customer Value) to measure the incremental lift in customer profit margin after adopting a new product or completing an action. The blog discusses causal inference and synthetic control methodology, comparing adopters (treatment group) to a matched group of non-adopters (control group) based on pre-adoption features. The article also highlights the interpretations, caveats, and non-additive nature of Delta CV while emphasizing its role in decision-making at PayPal.

https://medium.com/paypal-tech/estimating-incremental-lift-in-customer-value-delta-cv-using-synthetic-control-522be5e3da3a

Dipankar Mazumdar: Concurrency Control in Open Data Lakehouse

One of the core features of LakeHouse formats is the support of concurrency and ACID guarantees. The author discusses the differences between pessimistic concurrency control, optimistic concurrency control, and multi-version concurrency control by comparing all three table formats (Hudi, DeltaLake & Iceberg) concurrency implementations.

https://hudi.apache.org/blog/2025/01/28/concurrency-control/

Thomas F McGeehan V: Redefining Data Engineering with Go and Apache Arrow

The continuous impact of Apache Arrow in data engineering is undeniable. The author highlights the same by demonstrating the efficiency of adopting Streaming Arrow RecordBatches to build a zero-copy streaming pipeline, eliminating serialization overhead and enabling direct, columnar, high-throughput data movement between databases and processing engines.

https://medium.com/@mcgeehan/redefining-data-engineering-with-go-and-apache-arrow-df9059ddf55c

All rights reserved ProtoGrowth Inc, India. I have provided links for informational purposes and do not suggest endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.

Data Engineering Weekly