Data Engineering Weekly

Data Engineering Weekly

Share this post

Data Engineering Weekly
Data Engineering Weekly
Data Engineering Weekly #10
Copy link
Facebook
Email
Notes
More
User's avatar
Discover more from Data Engineering Weekly
The Weekly Data Engineering Newsletter
Over 35,000 subscribers
Already have an account? Sign in

Data Engineering Weekly #10

Weekly data engineering newsletter

Ananth Packkildurai's avatar
Ananth Packkildurai
Sep 28, 2020

Share this post

Data Engineering Weekly
Data Engineering Weekly
Data Engineering Weekly #10
Copy link
Facebook
Email
Notes
More
Share

Welcome to the 10th edition of the data engineering newsletter. This week's release is a new set of articles that focus on scaling the data platform, ClickHouse vs. Druid, Apache Kafka vs. Pulsar, Apache Spark performance tuning, and the Tensorflow Recommenders from Google, Twitter, Linkedin, eBay, DoorDash, Zendesk & Criteo.


Doordash writes an excellent blog post on its journey to build the data platform to delight the customer journey. The article is a brilliant reference model to implement data engineering to impact an enterprise.

https://doordash.engineering/2020/09/25/how-doordash-is-scaling-its-data-platform/


Linkedin writes about the evolution of its experimentation platform, T-REX. It is an excellent read to understand the prehistory of one of the largest experimentation platform and how it evolves from experiment management and delivery system with a UI application; the system gradually evolved into a platform that comprises targeting, dynamic configuration and experiment infrastructure, insight, and reporting pipelines, a notification system, and a seamless UI experience.

https://engineering.linkedin.com/blog/2020/our-evolution-towards-t-rex--the-prehistory-of-experimentation-i


Twitter recently builds a streaming data logging pipeline for its home timeline prediction system using Apache Kafka and Kafka Streams to replace the existing offline batch pipeline at a massive scale. The blog post narrates customized Kafka Streams join DSL that supports the ML-specific logging pipeline at the Twitter scale.

https://www.confluent.io/blog/how-twitter-built-a-machine-learning-pipeline-with-kafka/


eBay OLAP engine process more than 1 billion OLAP events per second. The legacy system build on top of Druid was found expensive to run. eBay writes about its journey towards migrating to ClickHouse on Kubernetes.

https://tech.ebayinc.com/engineering/ou-online-analytical-processing/


Zendesk writes an excellent post on comparing Apache Kafka with Pulsar. The tiered storage, dynamic scaling, and the growing number of partitions are an essential consideration. The evaluation concluded that though the Pulsar features are exciting, the system's stability still requires attention.

https://medium.com/zendesk-engineering/evaluating-apache-pulsar-92e6ed3fc792


Airbnb opensource its react visualization library Visx. The primary advantage of Visx to reduce the context switching for the front-end engineers familiar with React and build the custom charting library.

https://medium.com/airbnb-engineering/introducing-visx-from-airbnb-fd6155ac4658


Clickstreams and user activities are at the center stage of our data product lines, yet handling detailed event data processing, especially about timestamps and event order, is challenging. Expedia writes an excellent blog post narrates a strong case of vigilant about the time ordering for the event processing.

https://medium.com/expedia-group-tech/be-vigilant-about-time-order-in-event-based-data-processing-cbfde600dd7d


Artificial Neural Networks offer significant performance benefits compared to other methodologies, but often at the expense of interpretability. The blog post narrates the case for explainable AI(XAI) to provide more transparency.

https://www.infoq.com/articles/explainable-ai-xai/


Criteo writes about Apache Spark performance tuning focused on the query compilation. The blog post narrates the difference of RDD's volcano model and Spark SQL's whole stage code generation and sample code to validate the performance.

https://medium.com/criteo-labs/under-the-hood-of-spark-performance-or-why-query-compilation-matters-c084e749be87


Can We Build a 100% Serverless ETL Following CI/CD Principles? The blog post is an excellent narration of building data pipeline using DBT, Google BigQuery, and Github actions. I'm excited about the direction of commoditizing the data infrastructure.

https://medium.com/swlh/dawn-of-dataops-can-we-build-a-100-serverless-etl-following-ci-cd-principles-3ca587ba1ec0


The blog post is an excellent referential narration of building scalable airflow infrastructure on top of Kubernetes, data volume, collecting metrics, and storing the secrets.

https://www.infoq.com/articles/distributed-data-pipelines-apache-airflow/


The recommender system, once the flagship area of interest in the ML world getting more commoditized. From recommending movies or restaurants to coordinating fashion accessories and highlighting blog posts and news articles, recommender systems are essential in machine learning. Google introduces TensorFlow Recommenders (TFRS), an open-source TensorFlow package that makes building, evaluating, and serving sophisticated recommender models easy.

https://blog.tensorflow.org/2020/09/introducing-tensorflow-recommenders.html


Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers' opinions.


Subscribe to Data Engineering Weekly

By Ananth Packkildurai · Launched 5 years ago
The Weekly Data Engineering Newsletter

Share this post

Data Engineering Weekly
Data Engineering Weekly
Data Engineering Weekly #10
Copy link
Facebook
Email
Notes
More
Share

Discussion about this post

User's avatar
Functional Data Engineering - A Blueprint
How to build a Recoverable & Reproducible data pipeline
Dec 22, 2022 â€¢ 
Ananth Packkildurai
73

Share this post

Data Engineering Weekly
Data Engineering Weekly
Functional Data Engineering - A Blueprint
Copy link
Facebook
Email
Notes
More
3
The Future of Data Engineering: DEW's 2025 Predictions
Emerging Innovations, Evolving Roles, and the Roadmap to Scalable AI-Driven Insights
Dec 19, 2024 â€¢ 
Ananth Packkildurai
47

Share this post

Data Engineering Weekly
Data Engineering Weekly
The Future of Data Engineering: DEW's 2025 Predictions
Copy link
Facebook
Email
Notes
More
2
Towards Composable Data Infrastructure
A Case for Federated Data Catalog
Apr 11 â€¢ 
Ananth Packkildurai
38

Share this post

Data Engineering Weekly
Data Engineering Weekly
Towards Composable Data Infrastructure
Copy link
Facebook
Email
Notes
More

Ready for more?

© 2025 Ananth Packkildurai
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share

Copy link
Facebook
Email
Notes
More

Create your profile

User's avatar

Only paid subscribers can comment on this post

Already a paid subscriber? Sign in

Check your email

For your security, we need to re-authenticate you.

Click the link we sent to , or click here to sign in.