Data Engineering Weekly #67

Weekly Data Engineering Newsletter

Dec 20, 2021

Data Engineering Weekly - Brought to You by RudderStack - the Customer Data Platform for Developers

RudderStack Provides data pipelines that make it easy to collect data from every application, website, and SaaS platform, then activate it in your warehouse and business tools.

Airbnb: Automating Data Protection at Scale

Airbnb writes a third-part series on automating data production at scale, focusing on CDC pipelines. Automated data privacy management is critical to GDPR & California Consumer Privacy Act. Airbnb walkthrough automation & alerting are in place with its data production service. TIL: Thrift & Protobuf does support custom annotations.

https://medium.com/airbnb-engineering/automating-data-protection-at-scale-part-3-34e592c45d46

Part 1: https://medium.com/airbnb-engineering/automating-data-protection-at-scale-part-1-c74909328e08

Part 2: https://medium.com/airbnb-engineering/automating-data-protection-at-scale-part-2-c2b8d2068216

Google AI: Interpretable Deep Learning for Time Series Forecasting

Most real-world datasets have a time component, and forecasting the future can unlock significant value. Google writes about Temporal Fusion Transformers for Interpretable Multi-horizon Time Series Forecasting details the Temporal Fusion Transformer (TFT), an attention-based DNN model for multi-horizon forecasting.

https://ai.googleblog.com/2021/12/interpretable-deep-learning-for-time.html

Nick Handel: A brief history of the metrics store

The increased specialization of data engineering opens a lot of innovations on effectively utilizing data at scale. The author captures a timeline of the data engineering practices from the data warehouses of Kimball to the metrics store model. If the metrics store gains mass adoption, I presume we will see a new family of specialized metrics databases similar to Prometheus or InfluxDB.

https://towardsdatascience.com/a-brief-history-of-the-metrics-store-28208ec8f6f1

Data Science @ Microsoft: Anatomy of a chart

Data Visualization is the interface between insights consumers & producers. Human perception heavily influences the interpretation of data visualization. Sometimes, the insight producer dumps all the visualization in front of the audience and leaves the human interpretation to play its parts. The author recommends a set of curated processes to tell data stories meaningfully.

https://medium.com/data-science-at-microsoft/anatomy-of-a-chart-9e420dc8495b

Elijah Meeks: Viz Palette for Data Visualization Color

Staying with data visualization, colors significantly shape perception. The author writes about the best practices for choosing color combinations and talks about the Viz Palette, a tool to pick and optimize colors in and out of JavaScript.

https://medium.com/@Elijah_Meeks/viz-palette-for-data-visualization-color-8e678d996077

Confluent: How to Survive an Apache Kafka Outage

Confluent writes about what could go wrong with the Kafka outage and best practices to handle the failures. The blog contains some exciting techniques on how Kafka producers can gracefully handle failure. The usage of fsync vs. the likes of the async disk API is an exciting read.

https://www.confluent.io/blog/how-to-survive-a-kafka-outage/

Yelp: Kafka on PaaSTA - Running Kafka on Kubernetes at Yelp

Continuing on the Kafka infrastructure story, Yelp writes about its Kafka architecture on Kubernetes. The blog writes an overview of Yelp's usage of CruiseControl to automate the Kafka operations, and I highly recommend using it in production to reduce the operational toll.

https://engineeringblog.yelp.com/2021/12/kafka-on-paasta-part-one.html

KeepTruckIn: How Standardized Tooling and Metadata Saved Our Data Organization

Modern data warehouses build on multiple data sources and diverse data producers and consumers. As complexity grows, the need for standardization of ownership, alerting, testing & quality plays a significant role in establishing trust in the data platform. KeepTruckin shares its experience of how the standardized tooling & metadata saved their data org.

https://medium.com/keeptruckin-eng/how-metadata-saved-our-data-organization-cab3335eb4ae

Emily Thompson: Thinking of Analytics Tools as Products

It is evident that establishing standardization around data asset management tooling greatly helps the data organization, but how does one start to think about it. The author establishes the case for thinking of analytics tools as a product to bring integrity to the data platform.

The Data Leader's Survival Guide

Thinking of Analytics Tools as Products

Fitbit had a big product launch about three months after I started working there. It was the first smartwatch they made (the Ionic!) and all eyes in the company were on the data coming in during the first days out in the market. My team was on the hook for building the analytics tools that our stakeholders used to measure …

4 years ago · 3 likes · 1 comment · Emily Thompson

Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.

Data Engineering Weekly