Automate Airflow deploys with built-in CI/CD.
Streamline code deployment, enhance collaboration, and ensure DevOps best practices with Astro's robust CI/CD capabilities.
Sebastian Raschka: Understanding Reasoning LLMs
The reasoning capabilities of LLM open up building learning agents. This article discusses reasoning models, a specialization of LLMs for complex tasks requiring multi-step generation. The author outlines four key approaches to building these models: inference-time scaling, pure reinforcement learning, supervised finetuning with reinforcement learning, and distillation via supervised finetuning. The article also highlights DeepSeek R1 as a milestone in open-weight reasoning models and emphasizes that effective, budget-friendly strategies, like distillation and journey learning, enable smaller-scale research.
https://magazine.sebastianraschka.com/p/understanding-reasoning-llms
Maarten Grootendorst: A Visual Guide to Reasoning LLMs
This article provides another exciting explanation of reasoning capabilities in LLM. It explores reasoning LLMs and highlights the shift from scaling train-time compute to test-time compute for improved performance. The author visually explains techniques like Chain-of-Thought, search against verifiers, and modifying proposal distributions, using DeepSeek-R1 as a key example. The article also emphasizes DeepSeek-R1's training pipeline focused on reinforcement learning and touches upon the distillation of smaller models and even unsuccessful attempts.
https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-reasoning-llms
Chip Huyen: Common pitfalls when building generative AI applications
Enterprises are increasingly trying to build application capabilities to leverage Gen-AI capabilities. This author outlines common pitfalls in building generative AI applications, including unnecessarily using generative AI, mistaking product issues for AI flaws, starting with overly complex solutions, and overestimating early success. The blog highlights the over-reliance on AI for evaluation instead of human input and crowdsourcing use cases without a comprehensive strategy.
https://huyenchip.com/2025/01/16/ai-engineering-pitfalls.html
Sponsored: Webinar - Implementing CI/CD workflows for your Airflow pipelines
Want to automate key parts of your Apache Airflow pipeline development lifecycle?
In this session, Marc Lamberti and Kenten Danas will cover everything you need to know about using CI/CD to manage your Airflow DAGs, including:
→ The basics of using CI/CD with Airflow
→ How to leverage Astro’s built-in Github integration and other CI/CD features
→ Strategies for choosing and implementing the best deployment options
LinkedIn: Building collaborative prompt engineering playgrounds using Jupyter Notebook
Prompt engineering is a fundamental aspect of leveraging LLMs, representing a significant shift in how we interact with technology. However, developing customer-ready features requires a custom setup that integrates smoothly with the development environment and its requirements. LinkedIn writes about how it built the prompt playground using Jupyter Notebook to set the baseline model.
Alex Milowski: A Survey of Workflow Orchestration Systems
Workflow orchestration is a core component in a business, ranging from business process automation, data pipeline, and AI/ML workload. It is interesting to see a strong trend to use YAML as a syntax for describing the graph of tasks in the workflow DSL.
https://mlops.community/a-survey-of-workflow-orchestration-systems/
Netflix: Introducing Impressions at Netflix
High-quality activity tracking is vital for a data-driven organization. Netflix writes about its impression tracking system, which captures user interactions with content previews to enhance personalization. The blog describes the system's architecture, including collecting and processing raw events via Apache Kafka and Apache Flink, enriching them, and storing them in Apache Iceberg. The article also highlights their data quality measures.
https://netflixtechblog.com/introducing-impressions-at-netflix-e2b67c88c9fb
PayPal: Estimating Incremental Lift in Customer Value (Delta CV) using Synthetic Control
PayPal writes about using "Delta CV" (Delta Customer Value) to measure the incremental lift in customer profit margin after adopting a new product or completing an action. The blog discusses causal inference and synthetic control methodology, comparing adopters (treatment group) to a matched group of non-adopters (control group) based on pre-adoption features. The article also highlights the interpretations, caveats, and non-additive nature of Delta CV while emphasizing its role in decision-making at PayPal.
Dipankar Mazumdar: Concurrency Control in Open Data Lakehouse
One of the core features of LakeHouse formats is the support of concurrency and ACID guarantees. The author discusses the differences between pessimistic concurrency control, optimistic concurrency control, and multi-version concurrency control by comparing all three table formats (Hudi, DeltaLake & Iceberg) concurrency implementations.
https://hudi.apache.org/blog/2025/01/28/concurrency-control/
Thomas F McGeehan V: Redefining Data Engineering with Go and Apache Arrow
The continuous impact of Apache Arrow in data engineering is undeniable. The author highlights the same by demonstrating the efficiency of adopting Streaming Arrow RecordBatches to build a zero-copy streaming pipeline, eliminating serialization overhead and enabling direct, columnar, high-throughput data movement between databases and processing engines.
https://medium.com/@mcgeehan/redefining-data-engineering-with-go-and-apache-arrow-df9059ddf55c
All rights reserved ProtoGrowth Inc, India. I have provided links for informational purposes and do not suggest endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.