Welcome to the 19th edition of the data engineering newsletter. This week's release is a new set of articles that focus on Data Quality part-2 at Airbnb, Dynamic Data Testing, Medium story on how counting is a hard problem, Opinionated view on AWS managed Apache Airflow, Challenges in Deploying ML application by the University of Cambridge, Tableau’s Snowflake migration, data platform at Bukalapak, and T3GO experience with Apache Hudi and Aluxio.
Airbnb writes the second part of its data quality effort dealing with the certified data products. The blog narrates the data quality’s multi-dimensional problem of Accuracy, Availability, Consistency, Cost Efficiency, Usability, and Timeliness.
https://medium.com/airbnb-engineering/data-quality-at-airbnb-870d03080469
Part 1 of Data Quality at Airbnb
Data Quality is essential for building data infrastructure, but how to build the data quality framework? Dynamic data testing is an excellent article that narrates key design constraints and a maturity framework from static data testing to dynamic data testing.
https://medium.com/anomalo-hq/dynamic-data-testing-f831435dba90
Counting is the hardest problem in computer science. Twitter famously admitted users count error in one of the earning call. Medium has gone through a similar problem where a recent wave of users has questioned the validity of the displayed number of followers on their profiles. Medium narrates its experience dealing with the counting error and how a backfill changes the narrative.
https://medium.engineering/counting-your-followers-facbfafe45d9
Challenges in Deploying Machine Learning: a Survey of Case Studies is an excellent paper to read from the University of Cambridge that narrates the challenges of deploying the ML model from data management to model deployment.
https://arxiv.org/abs/2011.09926
AWS announced managed Apache Airflow as a service this week. I’m not entirely sure how I feel about it. AWS keeps simply packaging open-source solutions without any contributions back. However, there is not much innovation from AWS in this space. Redshift’s failure to innovate leads to Snowflake’s successful IPO. Should AWS focus on innovating its offering rather than simply packaging open-source solutions?
https://aws.amazon.com/blogs/aws/introducing-amazon-managed-workflows-for-apache-airflow-mwaa/
Tableau writes it’s on-prem to Snowflake cloud data warehouse migration experience and some of the key constraints before the migration.
https://www.tableau.com/blog/2020/11/our-prem-cloud-database-migration-collaborative-effort
Grab writes an excellent blog on its experience building the Fare calculator system. The CQRS architectural pattern triggers the ongoing domain events vs. CDC argument for event sourcing and exciting space to watch in data engineering.
https://engineering.grab.com/democratizing-fare-storage-at-scale-using-event-sourcing
Operating data infrastructure requires a significant engineering effect. Bukalapak, an Indonesian E-Commerce company, wrote about its experience running an in-house data infrastructure, the operational complexity, and migration to the cloud. I enjoyed reading the blog, which openly admits the engineering challenges and pivoting to a much simpler solution.
https://medium.com/bukalapak-data/data-platform-transformation-at-bukalapak-1085865a5c86
T3GO, a Chinese smart travel platform, writes its experience using Apache Hudi, Aluxio, and Alibaba OSS. Apache Hudi becomes a popular data lake solution and combining with Aluxio accelerates the cloud workload while accessing the data. The benchmark comparing HDFS with OSS & Aluxio looks impressive.
Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers' opinions.