Data Engineering Weekly - Brought to You by RudderStack - the Customer Data Platform for Developers
RudderStack
Provides data pipelines that make it easy to collect data from every application, website, and SaaS platform, then activate it in your warehouse and business tools.
Airbnb: Automating Data Protection at Scale
Airbnb writes a third-part series on automating data production at scale, focusing on CDC pipelines. Automated data privacy management is critical to GDPR & California Consumer Privacy Act. Airbnb walkthrough automation & alerting are in place with its data production service. TIL: Thrift & Protobuf does support custom annotations.
https://medium.com/airbnb-engineering/automating-data-protection-at-scale-part-3-34e592c45d46
Part 1: https://medium.com/airbnb-engineering/automating-data-protection-at-scale-part-1-c74909328e08
Part 2: https://medium.com/airbnb-engineering/automating-data-protection-at-scale-part-2-c2b8d2068216
Google AI: Interpretable Deep Learning for Time Series Forecasting
Most real-world datasets have a time component, and forecasting the future can unlock significant value. Google writes about Temporal Fusion Transformers for Interpretable Multi-horizon Time Series Forecasting
details the Temporal Fusion Transformer (TFT), an attention-based DNN model for multi-horizon forecasting.
https://ai.googleblog.com/2021/12/interpretable-deep-learning-for-time.html
Nick Handel: A brief history of the metrics store
The increased specialization of data engineering opens a lot of innovations on effectively utilizing data at scale. The author captures a timeline of the data engineering practices from the data warehouses of Kimball to the metrics store model. If the metrics store gains mass adoption, I presume we will see a new family of specialized metrics databases similar to Prometheus or InfluxDB.
https://towardsdatascience.com/a-brief-history-of-the-metrics-store-28208ec8f6f1
Data Science @ Microsoft: Anatomy of a chart
Data Visualization is the interface between insights consumers & producers. Human perception heavily influences the interpretation of data visualization. Sometimes, the insight producer dumps all the visualization in front of the audience and leaves the human interpretation to play its parts. The author recommends a set of curated processes to tell data stories meaningfully.
https://medium.com/data-science-at-microsoft/anatomy-of-a-chart-9e420dc8495b
Elijah Meeks: Viz Palette for Data Visualization Color
Staying with data visualization, colors significantly shape perception. The author writes about the best practices for choosing color combinations and talks about the Viz Palette
, a tool to pick and optimize colors in and out of JavaScript.
https://medium.com/@Elijah_Meeks/viz-palette-for-data-visualization-color-8e678d996077
Sponsored: The Data Stack Show Live - What is the Modern Data Stack?
Join The Data Stack Show Live for a special panel with experts from Databricks, dbt, Fivetran, Essence VC, and Hinge. The panel will look at the modern stack from all angles and discuss the future of data tooling.
https://rudderstack.com/video-library/the-data-stack-show-live-what-is-the-modern-data-stack/
Confluent: How to Survive an Apache Kafka Outage
Confluent writes about what could go wrong with the Kafka outage and best practices to handle the failures. The blog contains some exciting techniques on how Kafka producers can gracefully handle failure. The usage of fsync vs. the likes of the async disk API is an exciting read.
https://www.confluent.io/blog/how-to-survive-a-kafka-outage/
Yelp: Kafka on PaaSTA - Running Kafka on Kubernetes at Yelp
Continuing on the Kafka infrastructure story, Yelp writes about its Kafka architecture on Kubernetes. The blog writes an overview of Yelp's usage of CruiseControl to automate the Kafka operations, and I highly recommend using it in production to reduce the operational toll.
https://engineeringblog.yelp.com/2021/12/kafka-on-paasta-part-one.html
KeepTruckIn: How Standardized Tooling and Metadata Saved Our Data Organization
Modern data warehouses build on multiple data sources and diverse data producers and consumers. As complexity grows, the need for standardization of ownership, alerting, testing & quality plays a significant role in establishing trust in the data platform. KeepTruckin shares its experience of how the standardized tooling & metadata saved their data org.
https://medium.com/keeptruckin-eng/how-metadata-saved-our-data-organization-cab3335eb4ae
Emily Thompson: Thinking of Analytics Tools as Products
It is evident that establishing standardization around data asset management tooling greatly helps the data organization, but how does one start to think about it. The author establishes the case for thinking of analytics tools as a product to bring integrity to the data platform.
Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.