Data Engineering Weekly #186

The Weekly Data Engineering Newsletter

Aug 26, 2024

Try Fully Managed Apache Airflow for FREE

Run Airflow without the hassle and management complexity. Take Astro (the fully managed Airflow solution) for a test drive today and unlock a suite of features designed to simplify, optimize, and scale your data pipelines. For a limited time, new sign-ups will receive a complimentary Airflow Fundamentals Certification exam (normally $150).

Try For Free →

Conference Alert: Data Engineering for AI/ML

This is a virtual conference at the intersection of Data and AI. It is not a conference for the hype. It’s real users talking about real experiences.

You will not hear the words AGI. Instead, you will see engineers, product managers, and founders talking about the data engineering practices of AI/ML. It will be fun and educational.

- 40+ speakers

- 12th September 2024

- Three simultaneous virtual tracks

- Panels, Workshops, Lighting Talks, Keynotes, Fireside Chats and Entertainment.

https://home.mlops.community/public/events/dataengforai

Mario Fischer: How Google Search ranking works

One of the educational reads for me this week is about how Google search ranking works. The author summarizes the learning from the Google Search Engine paper leak and various public hearing documents from antitrust cases. Google’s search ranking system is a complex, multi-step process that begins with indexing new content, assigning it a unique DocID, and calculating its relevance based on keyword presence. It then passes through various ranking systems like Mustang, Superroot, and NavBoost, which refine the results to the top 10 based on factors like content quality, user behavior, and link analysis.

https://searchengineland.com/how-google-search-ranking-works-445141

FourSquare: Modern Data Platform: An Unbundling of a Traditional Data Warehouse

When building their data platform, companies face a critical decision: adopt an all-in-one solution from vendors like Databricks, Snowflake, or AWS or compose a custom platform using tools from different providers. The blog is a good overview of various components in a typical data stack.

I think an all-in-one solution or best-of-breed will be a big decision in the industry in the coming years. The Data Engineering Weekly is trying to address this issue with our buyer’s guide, starting with CDC.

https://location.foursquare.com/resources/blog/leadership/modern-data-platform-an-unbundling-of-a-traditional-data-warehouse/

RazorPay: Structuring the Analytics Team: Distributed vs. Centralized Approaches

Structuring the analytical organization to align with the business goal is always challenging. Several companies have written about centralized vs. decentralized organization structures, and RazorPay has made solid recommendations on approaching the problem.

https://engineering.razorpay.com/structuring-the-analytics-team-distributed-vs-centralized-approaches-ad91f4da9f89

Marc Olson: Continuous reinvention: A brief history of block storage at AWS.

Since its introduction, EBS has come a long way, and we prefer EBS over instance stores in many use cases. The author narrates a fascinating perspective on the journey of EBS, queueing theory with an amazing analogy, and the importance of comprehensive instrumentation in improving the systems. This is a must-read for this week.

https://allthingsdistributed.com/2024/08/continuous-reinvention-a-brief-history-of-block-storage-at-aws.html

Murat: Understanding the Performance Implications of Storage-Disaggregated Databases

The separation of storage and computing certainly brings a lot of flexibility in operating data stores. We witnessed the uprising of serverless/S3-dependent message queues and Postgres engines. The author writes an overview of the performance implication of disaggregated systems compared to traditional monolithic databases.

https://muratbuffalo.blogspot.com/2024/07/understanding-performance-implications.html

Colin Break: Predicting the Future of Distributed Systems

Many systems—everything from relational databases, time-series databases, message queues, data warehouses, and services for application metrics—use object storage as a core part of their architecture.

I often wonder if we are building a pyramid infrastructure scheme on top of the object storage. The author gave a different perspective with the one-way-door vs. two-way-door analogy, stating that object storage has been around for 20+ years now. The prediction around the innovation of the programming model is an interesting read.

https://blog.colinbreck.com/predicting-the-future-of-distributed-systems/

Max Meldrum: Introducing datafusion-uwheel, A Native DataFusion Optimizer for Time-based Analytics

µWheel is an event-driven aggregate management system for ingesting, indexing, and querying stream aggregates. The author writes about integrating uwheel with DataFusion (a query engine for building high-quality data-centric systems in Rust, using the Apache Arrow in-memory format). The temporal aggregation and pruning using custom indices, which drastically reduce query execution times, is exciting, and I look forward to trying this out soon.

https://uwheel.rs/post/datafusion/

https://uwheel.rs/post/datafusion_uwheel/

Lyft: Protocol Buffer Design - Principles and Practices for Collaborative Development

The blog discusses the key design principles and best practices for collaborative development using protobuf. More than that, I included this blog to remind you of the importance of event governance and a structured eventing approach, which is critical for an organization to build a strong data foundation.

Treat Events as a first-class citizen, and remember that it is always the upstream that causes the failure.

https://eng.lyft.com/protocol-buffer-design-principles-and-practices-for-collaborative-development-8f5aa7e6ed85

Airbnb: Personal Data Classification

Airbnb writes about the criticality of data classification and the impact on user trust if we fail to do so. The blog narrates the shift-left approach in data governance with three critical principles.

Shift data classification from data to schema
Shift classification from offline to online
Shift from Data Steward to Data Owner

https://medium.com/airbnb-engineering/personal-data-classification-2d816d8ea516

Vu Trinh: I spent 8 hours learning Parquet. Here’s what I discovered

Apache Parquet becomes the defacto columnar storage for LakeHouse formats, and it is a must for data engineers to know its internals. The author did an amazing job of describing how Parquet stores the data and compression and metadata strategies.

https://medium.com/@vutrinh274/i-spent-8-hours-learning-parquet-heres-what-i-discovered-97add13fb28f

All rights reserved ProtoGrowth Inc, India. I have provided links for informational purposes and do not suggest endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employer” opinions.

Data Engineering Weekly