Data Engineering Weekly #10

Weekly data engineering newsletter

Welcome to the 10th edition of the data engineering newsletter. This week's release is a new set of articles that focus on scaling the data platform, ClickHouse vs. Druid, Apache Kafka vs. Pulsar, Apache Spark performance tuning, and the Tensorflow Recommenders from Google, Twitter, Linkedin, eBay, DoorDash, Zendesk & Criteo.


Doordash writes an excellent blog post on its journey to build the data platform to delight the customer journey. The article is a brilliant reference model to implement data engineering to impact an enterprise.

https://doordash.engineering/2020/09/25/how-doordash-is-scaling-its-data-platform/


Linkedin writes about the evolution of its experimentation platform, T-REX. It is an excellent read to understand the prehistory of one of the largest experimentation platform and how it evolves from experiment management and delivery system with a UI application; the system gradually evolved into a platform that comprises targeting, dynamic configuration and experiment infrastructure, insight, and reporting pipelines, a notification system, and a seamless UI experience.

https://engineering.linkedin.com/blog/2020/our-evolution-towards-t-rex--the-prehistory-of-experimentation-i


Twitter recently builds a streaming data logging pipeline for its home timeline prediction system using Apache Kafka and Kafka Streams to replace the existing offline batch pipeline at a massive scale. The blog post narrates customized Kafka Streams join DSL that supports the ML-specific logging pipeline at the Twitter scale.

https://www.confluent.io/blog/how-twitter-built-a-machine-learning-pipeline-with-kafka/


eBay OLAP engine process more than 1 billion OLAP events per second. The legacy system build on top of Druid was found expensive to run. eBay writes about its journey towards migrating to ClickHouse on Kubernetes.

https://tech.ebayinc.com/engineering/ou-online-analytical-processing/


Zendesk writes an excellent post on comparing Apache Kafka with Pulsar. The tiered storage, dynamic scaling, and the growing number of partitions are an essential consideration. The evaluation concluded that though the Pulsar features are exciting, the system's stability still requires attention.

https://medium.com/zendesk-engineering/evaluating-apache-pulsar-92e6ed3fc792


Airbnb opensource its react visualization library Visx. The primary advantage of Visx to reduce the context switching for the front-end engineers familiar with React and build the custom charting library.

https://medium.com/airbnb-engineering/introducing-visx-from-airbnb-fd6155ac4658


Clickstreams and user activities are at the center stage of our data product lines, yet handling detailed event data processing, especially about timestamps and event order, is challenging. Expedia writes an excellent blog post narrates a strong case of vigilant about the time ordering for the event processing.

https://medium.com/expedia-group-tech/be-vigilant-about-time-order-in-event-based-data-processing-cbfde600dd7d


Artificial Neural Networks offer significant performance benefits compared to other methodologies, but often at the expense of interpretability. The blog post narrates the case for explainable AI(XAI) to provide more transparency.

https://www.infoq.com/articles/explainable-ai-xai/


Criteo writes about Apache Spark performance tuning focused on the query compilation. The blog post narrates the difference of RDD's volcano model and Spark SQL's whole stage code generation and sample code to validate the performance.

https://medium.com/criteo-labs/under-the-hood-of-spark-performance-or-why-query-compilation-matters-c084e749be87


Can We Build a 100% Serverless ETL Following CI/CD Principles? The blog post is an excellent narration of building data pipeline using DBT, Google BigQuery, and Github actions. I'm excited about the direction of commoditizing the data infrastructure.

https://medium.com/swlh/dawn-of-dataops-can-we-build-a-100-serverless-etl-following-ci-cd-principles-3ca587ba1ec0


The blog post is an excellent referential narration of building scalable airflow infrastructure on top of Kubernetes, data volume, collecting metrics, and storing the secrets.

https://www.infoq.com/articles/distributed-data-pipelines-apache-airflow/


The recommender system, once the flagship area of interest in the ML world getting more commoditized. From recommending movies or restaurants to coordinating fashion accessories and highlighting blog posts and news articles, recommender systems are essential in machine learning. Google introduces TensorFlow Recommenders (TFRS), an open-source TensorFlow package that makes building, evaluating, and serving sophisticated recommender models easy.

https://blog.tensorflow.org/2020/09/introducing-tensorflow-recommenders.html


Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers' opinions.