Data Engineering Weekly #51

Weekly Data Engineering Newsletter

Welcome to the 51st edition of the data engineering newsletter. This week's release is a new set of articles that focus on Uber's operational excellence in the data quality experience, Airbnb's "Wall Framework" to prevent data bugs, Tiffany Jachja's first three weeks as a data engineering manager, Hurb.com's data platform architecture, RudderStack's churn prediction with BigQueryML, Disney Streaming's Voidbox-Docker on YARN, AWS's expiring S3 object based on Last Access Date, and High Scalability's evolution of search engine architecture.

Data Engineering Weekly - Brought to You by RudderStack - the Customer Data Platform for Developers

RudderStack provides data pipelines that make it easy to collect data from every application, website, and SaaS platform, then activate it in your warehouse and business tools.


Uber: How Uber Achieves Operational Excellence in the Data Quality Experience

Poor data quality not only leads to a degraded machine learning model but also requires a lot of laborious manual effort to investigate and refill. Uber writes about its Unified Data Quality Platform(UDQ) that automatically detects data quality issues. The approach to generate automatic test cases from the past learning and metadata fields emphasis the more significant role of data lineage and metadata-driven workflow.

https://eng.uber.com/operational-excellence-data-quality/


Airbnb: How Airbnb Built “Wall” to prevent data bugs

On a similar data quality journey of Uber, Airbnb writes about Wall Framework, its abstraction on top of Airflow where users can add data quality check as part of the Airflow DAG. Wall framework is a config-driven approach that provides the most common DQ checks & anomaly detection as a service.

https://medium.com/airbnb-engineering/how-airbnb-built-wall-to-prevent-data-bugs-ad1b081d6e8f


Tiffany Jachja: My First Three Weeks as a Data Engineering Manager

The author shared the first three weeks of experience as a data engineering manager. It is a good read for any new data engineering manager from aligning the team on a joint mission, clear distinction of roles & responsibility.

https://tiffanyjachja.medium.com/my-first-three-weeks-a-data-engineering-manager-8b0be08da7a5


Hurb.com: Data Platform Architecture at Hurb.com

Hurb.com, one of the major OTAs in Latin America, writes about an overview of its data infrastructure. The article is a great reference architecture for a Google cloud platform with the adoption of Google dataflow & BigQuery. The exciting part of the article where the author discusses the choice of data visualization engine, how per-user billing preventing them from democratizing the data, and the choice of Metabase to address the issue.

https://medium.com/hurb-engineering/data-platform-architecture-at-hurb-com-8c472c051fa2


Sponsored: RudderStack - Churn Prediction With BigQueryML to Increase Mobile Game Revenue

Here’s an interesting case study on how machine learning can directly impact the bottom line. RudderStack writes an outline of how app developers, Torpedo Labs, use BigQuery ML to identify high-value mobile game players who are dangerously close to churning.

https://rudderstack.com/blog/churn-prediction-with-bigqueryml


Disney Streaming: Voidbox-Docker on YARN

Disney Streaming writes about Voidbox, which enables any application encapsulated in docker image running on YARN cluster along with MapReduce and Spark. Voidbox supports Docker container-based DAG(Directed Acyclic Graph) tasks in execution is an exciting approach where Voidbox can encapsulate each step of the data pipeline as a Docker run.

https://medium.com/disney-streaming/voidbox-docker-on-yarn-e1b9f3a789ec


AWS: Expiring Amazon S3 Objects Based on Last Accessed Date to Decrease Costs

S3 is a widely used system for building data lakes, websites, mobile applications, and enterprise applications even though S3 tiered storage can bring down the storage cost, but not be without the performance hit while accessing the tiered storage. AWS writes a reference architecture to delete the S3 objects based on Last Access Date using the S3 server access log & S3 inventory.

https://aws.amazon.com/blogs/architecture/expiring-amazon-s3-objects-based-on-last-accessed-date-to-decrease-costs/


HighScalability: Evolution Of Search Engines Architecture - Algolia New Search Architecture Part 1

Search engine plays a vital role in information retrieval, which is the critical function of data engineering. The article evaluates some of the critical milestones of the search engine architecture, and the challenges those architecture style faces today.

http://highscalability.com/blog/2021/8/2/evolution-of-search-engines-architecture-algolia-new-search.html


Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers' opinions.