Data Engineering Weekly #41

Weekly Data Engineering Newsletter

Welcome to the 41st edition of the data engineering newsletter. This week's release is a new set of articles that focus on Airbnb’s track & measure growth marketing, Dagster takes on Airflow vs. Dagster, NewYorkTimes data privacy tooling, Lyft’s ML model infrastructure on Kubernetes, Uber’s Orbit a time series forecasting library, LinkedIn’s Greykite time series forecasting library, GameChanger’s moving data out of Kafka, GroupOn’s SCD framework Pinion, eBay’s vendor to in-house analytical infrastructure migration, and Pinterest optimized Kafka mirror maker.

Airbnb: How does Airbnb track and measure growth marketing?

Airbnb writes about its unified tracking measurement system to support marketing campaigns by introducing C-parameter tracking and a system for analytics and growth evaluation. In addition, the blog narrates some of the drawbacks of UTM tracking and why it chooses a custom tracking system.

Dagster: Moving past Airflow - Why Dagster is the next-generation data orchestrator

Dagster writes an exciting blog comparing Dagster with Airflow in various lifecycles of a data pipeline development on developing & testing, Deploy & execute and monitor & observe. The metadata-rich, parameterizable functions––called solids, separation of computing and IO, support for Adhoc executions, process isolation with a clear separation of user process and system process, and flexible event-based scheduling are some of the exciting features to explore in Dagster.

NewYorkTimes: How We Manage New York Times Readers’ Data Privacy

The privacy policy and GDPR compliance can be challenging for consumer applications, given that there are more than 100 privacy laws by various countries. NYT writes an exciting blog on handling various privacy laws dynamically by its homegrown system called PURR (Privacy, Users, Rules, and Regulations).

Lyft: LyftLearn - ML Model Training Infrastructure built on Kubernetes

Lyft writes about its ML model infrastructure on Kubernetes focuses on various ML model development functions, model development, running the training & batch prediction jobs, and model user dashboard for previous model versions & job performances. The design focus on fast iterations, no restriction on supported modeling libraries and their versions, and enabling the system to be accessed programmatically are some of the exciting system design read.

Uber: Introducing Orbit, An Open Source Package for Time Series Inference and Forecasting

Uber as a marketplace business, forecasting is a vital aspect to solve the business problems. Uber writes about its open-source time-series library, Orbit, a Python package for Bayesian time series forecasting and inference which provides an intuitive initialize-fit-predict interface for time series tasks and uses probabilistic programming languages under the hood.

Paper: Orbit - Probabilistic Forecast with Exponential Smoothing

LinkedIn: Greykite - A flexible, intuitive, and fast forecasting library

A similar approach to Uber, To support LinkedIn’s forecasting needs, LinkedIn developed & open-sourced the Greykite Python library. Greykite contains a simple modeling interface that facilitates data exploration and model tuning. The Silverkite algorithm, which is the flagship algorithm of the Greykite library, works well on time series with (potentially time-varying) trends and seasonality, repeated events/holidays, and short-range effects.

Paper: A flexible forecasting model for production systems

GameChanger: From pipeline to beyond - Moving data out of Kafka to wherever else it's needed

Gamechanger writes about Tangent, its Kafka to S3 pipeline, and some of the learning while trying to adopt opensource systems such as Kafka Connect, Secor & Gobblin. The focus on monitoring approaches and the integration of terraforming generic autoscaling policies are exciting to read.

Groupon: Pinion — The Load Framework

Groupon writes about Pinion, an abstraction over the Delta lake APIs for S3 and spark-snowflake connector for Snowflake to do SCD type 1,2 & 3 operations in the respective target system. The configuration-driven, plug & play approach to handle the slowly changing dimension to increase the developer productivity is an exciting read on improving the data pipeline efficiency.

eBay: From Vendor to In-house - How eBay Reimagined Its Analytics Landscape

A data infrastructure at its core requires supporting two primary functions, a scalable batch & real-time computation and fast, interactive query & analytics. eBay writes about the challenges it faced with vendor solutions on the growing need for data governance & reliability and various customization on the opensource systems to move from the vendor solution to an open ecosystem.

Pinterest: Shallow Mirror - Enhancement to Kafka MirrorMaker to reduce CPU/memory pressure

Kafka MirrorMaker widely used replicate traffic among different Kafka clusters spread across multiple regions. Pinterest writes about its Shallow Mirror, an optimized Kafka Mirror Maker, the scalability challenges as the adoption grows, and some of its optimization to improve the Kafka mirror maker performance.

Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers' opinions.