Data Engineering Weekly #6

Weekly data engineering newsletter

Welcome to the sixth edition of the data engineering newsletter. This week's release is a new set of articles that focus on Kafka summit recap, Type 2 dimension modeling, securing Presto, handling bios in AI, and ML applications from Shopify, DoorDash, Linkedin & Confluent.


It's a Kafka summit week, and more than 33,000 registered users participated in the virtual conference. Confluent writes about the highlights of the summit in a couple of blog posts. The tech deep dive on Kafka Tiered storage and Zookeeper replacement is the major highlight in the Day 1 conference.

https://www.confluent.io/blog/kafka-summit-2020-day-1-recap/

https://www.confluent.io/blog/kafka-summit-2020-session-highlights/


Slowly changing dimensions are a challenge to the data models, especially in the big data era. In business analytics, we care about not only the current state but also the historical state. Shopify writes an exciting article about Type 2 dimensional modeling, creating these data models using modern ETL toolings like PySpark and dbt (data build tool), and the lessons learned.

https://engineering.shopify.com/blogs/engineering/track-state-type-2-dimensional-models


Presto becomes an essential component of the data infrastructure. Grab's data team writes about DataGateway, a Presto gateway service to intercept Presto queries, and authenticates its user Access Control List (ACL). The interceptor model is an exciting read compares to the Apache Ranger's plugin model.

https://engineering.grab.com/data-gateway


DoorDash writes about using a Human-in-the-Loop to Overcome the Cold Start Problem in Menu Item Tagging. Any ML-model based solution faces the cold start problem, where we don’t have enough labeled samples for each class to build a performant model. The blog post is an exciting read narrates the lifecycle of building the human-in-the-loop ML models and why they decided to go with ML generated tagging than embedding model.

https://doordash.engineering/2020/08/28/overcome-the-cold-start-problem-in-menu-item-tagging/


The primary concern in the growth of AI is the widespread societal injustice based on human biases reflected both in the data used to train AI models and the models themselves. Linkedin writes about its broader initiative to bring fairness to the AI applications. On the effort, LinkedIn open sources Linkedin Fairness Toolkit (LiFT), a Scala/Spark library that enables the measurement of fairness in large scale machine learning workflows. The LiFT can be deployed in training and scoring workflows to measure biases in training data, evaluate different fairness notions for ML models, and detect statistically significant differences in their performance across different subgroups.

https://engineering.linkedin.com/blog/2020/lift-addressing-bias-in-large-scale-ai-applications


Databricks writes about how we can adopt the Databricks platform to quantify the likelihood of customer churn. Though the blog skewed towards the Databricks platform, the lifecycle walkthrough of data preparation, feature engineering, model selection, and the Hyperparameter tuning is an exciting read.

https://databricks.com/blog/2020/08/24/profit-driven-retention-management-with-machine-learning.html


I came across this Github repo recently, which contains a rich collection of research papers relevant to data science and data engineering.

https://github.com/jarikoi/interesting-papers


GPU accelerated AI-powered applications breaking new grounds. NVIDIA leading the GPU research writes about new research, enhanced tools for creators. OpenVDB is the industry-standard library used by VFX studios for simulating water, fire, smoke, clouds, and other effects. It's interesting to read about NVIDIA's NanoVDB adds GPU support for OpenVDB compatible data structure where users can leverage GPUs to accelerate workflows such as ray tracing, filtering, and collision detection while maintaining compatibility with OpenVDB.

https://blogs.nvidia.com/blog/2020/08/25/nvidia-siggraph/


The data preparation in geographic data always a challenging task. The bigger the data, the more problematic is the process, especially when pruning noisy outliers and anomalies. ArcGIS is a geographic information system for working with maps and geographic information. The article narrates with a notebook example of data cleaning for the geographic information and integrates it with ArcGIS.

http://thunderheadxpler.blogspot.com/2020/08/on-machine-learning-in-arcgis-and-data.html


Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent the opinions of current, former, or future employers.