Data Engineering Weekly #8

Weekly data engineering newsletter

Sep 14, 2020

Welcome to the 8th edition of the data engineering newsletter. This week's release is a new set of articles that focus on Dagster, Kafka, experimentation platforms from Pinterest, Doordash, Confluent, Eventbrite, and Expedia.

Data validation is an integral part of the data pipeline. Dagster writes about its integration with Great Expectation, the fastest-growing open-source data validation, and documentation framework.

https://medium.com/dagster-io/great-expectations-for-dagster-b58d4f45c342

Confluent writes a post on implementing the message prioritization in Apache Kafka. It’s an important characteristic of Job Scheduler systems. The bucket priority pattern with a Bucket Priority Assigner is an exciting pattern to read.

https://www.confluent.io/blog/prioritize-messages-in-kafka/

The SeatGeek opensource its data pipeline framework Druzhba, to extract and load data from various sources.

https://chairnerd.seatgeek.com/druzhba-open-source-release/

Pinterest writes its second part of Project LightHouse to measure Airbnb guest acceptance rates' discrepancies using anonymized demographic data.

https://medium.com/airbnb-engineering/project-lighthouse-part-2-measurement-with-anonymized-data-69fb01eac88

The Experimentation Platform is an essential part of rapid product development. Doordash writes about Curie; it's experimentation platform, and the journey from ad-hoc manual analysis to automate the experimentation lifecycle.

https://doordash.engineering/2020/09/09/experimentation-analysis-platform-mvp/

Eventbrite writes about its new feature, building a protest map. It’s an exciting read that narrates the executive buy-in, difficulties in data collection, and the challenges with the recency of the data.

https://www.eventbrite.com/engineering/building-a-protest-map-a-behind-the-scenes-look/

Expedia writes about Hyperspace by Microsoft, an indexing subsystem built on top of Apache Spark, which allows you to create indexes to support ad hoc queries just like a traditional database. It’s an exciting read providing Hyperspace offer index optimization on top of Apache Spark.

https://medium.com/expedia-group-tech/indexing-spark-data-with-microsofts-hyperspace-ec4de4b93ba3

Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers' opinions.

Data Engineering Weekly

Discussion about this post

Ready for more?