Data Engineering Weekly #44

Weekly Data Engineering Newsletter

Welcome to the 44th edition of the data engineering newsletter. This week's release is a new set of articles that focus on Chris Riccomini's What the heck is Data Mesh, Atlan's the rise of metadata lake, Google's data cascading in AI, Uber's evolution of data science workbench, Benchling's version control data aquarium, Shopify's deleting the undeletable, Expedia's self-serving business intelligence, eBay's optimizing analytical processing pipeline, Capital One's end-to-end ML model, Yelp's modernizing business data index, and Ashley Melanson's how DBT can transform your data pipeline.

Chris Riccomini: What the Heck is a Data Mesh?!

Data mesh is a widely discussed data engineering principle, and there were many exciting discussions around it. The concept of "data as a product" compelling, and many successful internet companies adopted it in the past with great success. The debate on how the data mesh principle encapsulates the data as a product & decentralized ownership is an exciting space to watch. The author shared some insightful views on data mesh principles.

A couple of interesting Twitter threads on the article very educational on data mesh principles.

Atlan: The Rise of the Metadata Lake

Modern business operations increasingly depend on data to derive their business. As the data takes the central role in the business operation, the number of stakeholders interacting with the data is more diverse than ever. In this increasingly diverse data world, metadata holds the key to the elusive promised land. Is it a time to think about metadata lake? The blog narrates the role of metadata lake in the modern data stack.

Google AI: Data Cascades in Machine Learning

Data is a foundational aspect of machine learning (ML) that can impact ML systems' performance, fairness, robustness, and scalability. Paradoxically, while building ML models are often highly prioritized, the work related to data is often the least prioritized aspect. The blog summarizes the recent ACM paper Everyone wants to do the model work, not the data work: Data Cascades in High-Stakes AI, and discuss how to address the data cascading effects.

Uber: The Evolution of Data Science Workbench

Uber writes about the evolution of its data science workbench, narrating the efficient scheduling, easier Apache Spark integration with the workspace, and package dependency management. The three key learning in the blog is educational read.

  1. Build for the experts, design for the less technical users

  2. Don’t stop at building what’s known; empower people to look for the unknown

  3. Create communities with both data scientists and non-data scientists

Benchling: Building a version-controlled Data Aquarium

Benchling writes about its evolution of data infrastructure from a legacy warehouse to a continuous data pipeline tuned to increase the analyst velocity. The discussion around the challenges of implementing continuous integration, how data infrastructure is different from a traditional web application, and how Snowflake's zero-copy data clone helped achieve continuous data integration is an exciting read.

Shopify: Deleting the Undeletable

Deleting the data is one of the most complicated problems in data engineering. Often the data infrastructure misses the dependency graph, field-level context on PII data. Shopify writes an exciting blog highlighting the challenges of PII data and talks about its schematization system, obfuscation & enrichment strategy.

Expedia: Powering Self-Service Business Intelligence across Expedia Group

The Lakehouse design provides a delicate balance between the complicated data warehouses and inconsistent data lake systems. Expedia writes about their adoption of Lakehouses, the extension of lakehouse to domain-specific DataLakeMart, and OLAP systems.

eBay: Optimizing Analytics Data Processing on eBay’s New Open-Source-Based Platform

Tuning a data pipeline requires a layered approach to achieve SLA timelines. eBay writes about the various layers to consider while tuning the Spark jobs, such as system level, the process, table optimization, SQL optimization & the Apache Spark job config parameter tuning. The structured debugging approach is a delight to read, and this is the one spot the data infrastructure needs a lot of attention, from manual tuning to automated pipeline tuning.

Capital One: End-to-End Models for Complex AI Tasks

The main advantage of machine learning over traditional software engineering is that it allows one to build a component that performs a task by training a model from data, which removes the need for a human to precisely perform the task. Why can't we adopt end-to-end ML rather than part of the tasks? Capital One writes about the pros & cons of adopting an end-to-end ML model and the challenges ahead to reach the promising land of the end-to-end ML model.

Yelp: Modernizing Business Data Indexing

Serving the computed metrics to the end-user in an acceptable latency is critical for an enriched user experience. Yelp writes about its journey of business data indexing system that queried the MySQL table to stream-based CDC system that leverages Kafka, Flink, Apache Beam & Cassandra.

Ashley Melanson: Open Source Spotlight - How Dbt Can Transform Your Data Analytics Pipeline

I would be surprised if you've not heard or played around with DBT by now. If you've not done so far, the author did a great write-up breaking down the components of DBT.

Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers' opinions.