Data Engineering Weekly #117

The Weekly Data Engineering Newsletter

Feb 06, 2023

Data Engineering Weekly Is Brought to You by RudderStack

RudderStack provides data pipelines that make it easy to collect data from every application, website, and SaaS platform, then activate it in your warehouse and business tools. Sign up free to test out the tool today.

Google: Google Research, 2022 & beyond: ML & computer systems

Google continues to write about its advancement in AI, and this week’s publication talks about the advancement of distributed systems for ML & hardware acceleration. The ML for large-scale production systems highlights the improvement made from the existing heuristic in the YouTube cache replacement algorithm with a new hybrid algorithm that combines a simple heuristic with a learned model, improving the byte miss ratio at the peak by ~9%.

https://ai.googleblog.com/2023/02/google-research-2022-beyond-ml-computer.html

Julia Evans: Examples of floating point problems & Examples of problems with integers

As a data engineer, understanding data types is as important as data models and structures. The blog is an excellent overview of problems with floating points and integer data types.

https://jvns.ca/blog/2023/01/13/examples-of-floating-point-problems/

https://jvns.ca/blog/2023/01/18/examples-of-problems-with-integers/

Spotify: Unleashing ML Innovation at Spotify with Ray

Spotify writes about its ML infrastructure and talks about the democratization of ML infrastructure with the Ray platform. The blog walkthrough how with a single CLI command, users can create their own Ray cluster with preinstalled ML tools, ready-to-run notebook tutorials, VS Code server for in-browser editing, and SSH access.

https://engineering.atspotify.com/2023/02/unleashing-ml-innovation-at-spotify-with-ray/

Lyft: Powering Millions of Real-Time Decisions with LyftLearn Serving

The blog rightly points out that making real-time inferences with machine learning (ML) models at scale is complex. The blog narrates the challenges with real-time decision serving with clear ownership of services and how LyftLearn Serving design solves these problems efficiently.

https://eng.lyft.com/powering-millions-of-real-time-decisions-with-lyftlearn-serving-9bb1f73318dc

Adrian Cockcroft: Percentiles don’t work: Analyzing the distribution of response times for web services

System performance significantly impacts the business revenue, especially in e-commerce; for every 100 milliseconds of latency, they lost 1% in sales. Applying data engineering for system performance is one of my favorites, so the blog enjoyed reading this article. The blog highlights the challenges in measuring latency with average and percentile and discusses the alternatives.

https://adrianco.medium.com/percentiles-dont-work-analyzing-the-distribution-of-response-times-for-web-services-ace36a6a2a19

Xavier Amatriain: Blueprints for recommender system architectures: 10th-anniversary edition

The blog is an excellent recap of RecSys architecture from the days of Netflix’s three-tier architecture to the latest development in the RecSys architectures. The blog talks about four types of architecture.

Netflix three-tier architecture
Eugene Yan’s 2 x 2 blueprint
Nvidia’s 4 stage blueprintPermalink
Fennel.ai’s 8 stage blueprintPermalink

https://amatriain.net/blog/RecsysArchitectures

Timescale: Best Practices for Time-Series Data Modeling: Narrow, Medium or Wide Table Layout

One Big Table (OBT) vs. Other Schema modeling techniques is a hot topic in data engineering, which triggers some interesting conversations. The blog narrates the various data modeling techniques for modeling time series data and their pros and cons.

https://www.timescale.com/blog/best-practices-for-time-series-data-modeling-narrow-medium-or-wide-table-layout-2/

Swiggy: Building a mind reader at Swiggy using Data Science

The title looks scary 😱 Fear not; the blog narrates a recommender engine design challenge for a food delivery service. The blog highlights Swiggy’s recommender engine system design to recommend food orders, the limitations, and the design constraints associated with the system design.

https://bytes.swiggy.com/building-a-mind-reader-at-swiggy-using-data-science-5a5c38aa6c17

Super: dbt at Super; Orchestration, Continous Integration & Observability

After running roughly 500+ models, 2500+ tests, and 200+ sources into its debt project, Super shares its dbt infrastructure. The blog discusses the proactive & reactive orchestration mechanism, its continuous integration system with lineage and test coverage, and observability on cost per model.

https://medium.com/super/dbt-at-snapcommerce-part-1-orchestration-964c9a87b072

https://medium.com/super/dbt-at-snapcommerce-part-2-continuous-integration-260d4e782eba

https://medium.com/super/dbt-at-super-part-3-observability-c8755109901f

Dremio: The Write-Audit-Publish Pattern via Apache Iceberg

An out-of-the-box support for auditing before promoting a data model to a product is critical to fulfilling a data contract. Iceberg in the past writes about its Write-Audit-Publish support; It is exciting to see Apache Hudi supports the pattern from 0.9.0. I can’t find any reference that Delta Lake/ Snowflake supports this pattern. Please share in the comments if you found any reference to it.

https://www.dremio.com/resources/webinars/the-write-audit-publish-pattern-via-apache-iceberg/

https://hudi.apache.org/releases/release-0.9.0/#writer-side-improvements

All rights reserved ProtoGrowth Inc, India. Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.

Data Engineering Weekly