Data Engineering Weekly #46
Weekly Data Engineering Newsletter
Welcome to the 46th edition of the data engineering newsletter. This week's release is a new set of articles that focus on Pinterest's trusting metrics, Uber's CI/CD for MLOps, Julien's leveraging DBT as a data modeling tool, Facebook's time-series analytics, Pinterest's partial deserialization on thrift, DoorDash's supply-demand ML platform, Swiggy's predicting two-wheeler travel distance, and the story of DBT met Materialize.
Pinterest: Trusting Metrics at Pinterest
Data accuracy is strategically fundamental for the business to make data-driven decisions. The certified data metrics are the standard practice in many companies to bring trust in data. Pinterest writes an exciting post explaining how simple counting can be a complicated task and the metrics certification process works at Pinterest.
Uber: Continuous Integration and Deployment for Machine Learning Online Serving and Models
The rapid adoption of ML as a core part of feature development also brings significant operational challenges known as MLOps. Uber writes an exciting blog on the evolution of its CI features, including the dynamic model reloading, auto-shading & auto-expiration of the model for an efficient MLOps continuous integration system.
Julien Kervizic: Leveraging DBT as a Data Modeling tool
The blog reflects one year with DBT, answering questions on whether we can use DBT as a data modeling tool. The author narrates the pros and cons of DBT, from model features & documentation to testing strategy.
Facebook: Meet Kats — a one-stop shop for time series analysis
Facebook open source a Python library for generic time-series analysis Kats. Kats supports forecasting, time-series pattern detection, feature extraction & embedding, and time-series event simulators.
Pinterest: Improving data processing efficiency using partial deserialization of Thrift
The structured event stream processing brings challenges to data modeling. Often the event structure ends up a complex nested structure, and the consumers need to process only a subset of events most of the time. Serialization & deserialization is compute-intensive for the downstream consumers. Pinterest writes an existing blog on how it implemented partial deserialization on thrift to process the events efficiently.
DoorDash: Managing Supply and Demand Balance Through Machine Learning
DoorDash writes about its ML-driven approach for its Supply-Demand system that reduces the cancelation and delivery time. The blog is a classic reference design of matching the product requirement to system capabilities for an efficient operation.
Swiggy: Learning to Predict Two-Wheeler Travel Distance
Swiggy's data science team shares its system design on predicting the two-wheeler distance travel distance that uses the synthesized ground truth distance as labels and historical features to build an ML model. The blog that compares the existing distance computing model with the ML approach is an exciting read.
DBT: dbt + Materialize: Streaming to a dbt project near you
As the data volume increases, the processing pattern tends to move towards real-time processing rather than batch processing. DBT writes about the new adopter for Materialize, a SQL platform for processing stream data. The
streaming data warehouse is an exciting space to watch.
Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers' opinions.