Data Engineering Weekly #117
The Weekly Data Engineering Newsletter
Data Engineering Weekly Is Brought to You by RudderStack
RudderStack provides data pipelines that make it easy to collect data from every application, website, and SaaS platform, then activate it in your warehouse and business tools. Sign up free to test out the tool today.
Google: Google Research, 2022 & beyond: ML & computer systems
Google continues to write about its advancement in AI, and this week’s publication talks about the advancement of distributed systems for ML & hardware acceleration. The ML for large-scale production systems highlights the improvement made from the existing heuristic in the YouTube cache replacement algorithm with a new hybrid algorithm that combines a simple heuristic with a learned model, improving the byte miss ratio at the peak by ~9%.
Julia Evans: Examples of floating point problems & Examples of problems with integers
As a data engineer, understanding data types is as important as data models and structures. The blog is an excellent overview of problems with floating points and integer data types.
Spotify: Unleashing ML Innovation at Spotify with Ray
Spotify writes about its ML infrastructure and talks about the democratization of ML infrastructure with the Ray platform. The blog walkthrough how with a single CLI command, users can create their own Ray cluster with preinstalled ML tools, ready-to-run notebook tutorials, VS Code server for in-browser editing, and SSH access.
Sponsored: Upsolver - Write a SQL Query, Get a Data-in-Motion Pipeline!
Pipelines for data in motion can quickly turn into DAG hell. Upsolver SQLake lets you process fast-moving data by simply writing a SQL query.
Streaming plus batch unified in a single platform.
Stateful processing at scale - joins, aggregations, upserts
Orchestration auto-generated from the data and SQL
Templates with sample data for Kafka/Kinesis/S3 sources -> S3/Athena/Snowflake/Redshift
Lyft: Powering Millions of Real-Time Decisions with LyftLearn Serving
The blog rightly points out that making real-time inferences with machine learning (ML) models at scale is complex. The blog narrates the challenges with real-time decision serving with clear ownership of services and how LyftLearn Serving design solves these problems efficiently.
Adrian Cockcroft: Percentiles don’t work: Analyzing the distribution of response times for web services
System performance significantly impacts the business revenue, especially in e-commerce; for every 100 milliseconds of latency, they lost 1% in sales. Applying data engineering for system performance is one of my favorites, so the blog enjoyed reading this article. The blog highlights the challenges in measuring latency with average and percentile and discusses the alternatives.
Sponsored: [New Report] Data Engineering Trends and Predictions Report
There’s certainly more to building a good data engineering strategy in 2023 than 2022’s biggest buzzword. In this report, check out 9 key technologies, cultural shifts, and processes primed to define the new year.
Xavier Amatriain: Blueprints for recommender system architectures: 10th-anniversary edition
The blog is an excellent recap of RecSys architecture from the days of Netflix’s three-tier architecture to the latest development in the RecSys architectures. The blog talks about four types of architecture.
Netflix three-tier architecture
Eugene Yan’s 2 x 2 blueprint
Nvidia’s 4 stage blueprintPermalink
Fennel.ai’s 8 stage blueprintPermalink
Timescale: Best Practices for Time-Series Data Modeling: Narrow, Medium or Wide Table Layout
One Big Table (OBT) vs. Other Schema modeling techniques is a hot topic in data engineering, which triggers some interesting conversations. The blog narrates the various data modeling techniques for modeling time series data and their pros and cons.
Sponsored: Fireside Chat: The Future of CDPs
Join this live session with BARK CTO, Nari Sitaraman, & RudderStack Founder, Soumyadeb Mitra, on 2/15 at 9 AM PT to make sense of the CDP evolution and get practical advice on how to drive competitive advantage as a data leader in 2023.
Swiggy: Building a mind reader at Swiggy using Data Science
The title looks scary 😱 Fear not; the blog narrates a recommender engine design challenge for a food delivery service. The blog highlights Swiggy’s recommender engine system design to recommend food orders, the limitations, and the design constraints associated with the system design.
Super: dbt at Super; Orchestration, Continous Integration & Observability
After running roughly 500+ models, 2500+ tests, and 200+ sources into its debt project, Super shares its dbt infrastructure. The blog discusses the proactive & reactive orchestration mechanism, its continuous integration system with lineage and test coverage, and observability on cost per model.
Dremio: The Write-Audit-Publish Pattern via Apache Iceberg
An out-of-the-box support for auditing before promoting a data model to a product is critical to fulfilling a data contract. Iceberg in the past writes about its Write-Audit-Publish support; It is exciting to see Apache Hudi supports the pattern from 0.9.0. I can’t find any reference that Delta Lake/ Snowflake supports this pattern. Please share in the comments if you found any reference to it.
All rights reserved ProtoGrowth Inc, India. Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.