Data Engineering Weekly #44

Weekly Data Engineering Newsletter

Jun 14, 2021

Welcome to the 44th edition of the data engineering newsletter. This week's release is a new set of articles that focus on Chris Riccomini's What the heck is Data Mesh, Atlan's the rise of metadata lake, Google's data cascading in AI, Uber's evolution of data science workbench, Benchling's version control data aquarium, Shopify's deleting the undeletable, Expedia's self-serving business intelligence, eBay's optimizing analytical processing pipeline, Capital One's end-to-end ML model, Yelp's modernizing business data index, and Ashley Melanson's how DBT can transform your data pipeline.

Chris Riccomini: What the Heck is a Data Mesh?!

Data mesh is a widely discussed data engineering principle, and there were many exciting discussions around it. The concept of "data as a product" compelling, and many successful internet companies adopted it in the past with great success. The debate on how the data mesh principle encapsulates the data as a product & decentralized ownership is an exciting space to watch. The author shared some insightful views on data mesh principles.

https://cnr.sh/essays/what-the-heck-data-mesh

A couple of interesting Twitter threads on the article very educational on data mesh principles.

Josh Wills @josh_wills

This was an extremely good read on data meshes by @criccomini: cnr.sh/essays/what-th… (I love the vision of data mesh, but I don't see the incentives for non-ML/non-data developers to build and own data products.)

cnr.shWhat the Heck is a Data Mesh?!I got sucked into a data mesh Twitter thread this weekend (it’s worth a read if you haven’t seen it). Data meshes have clearly struck a nerve. Some don’t understand them, while others believe they’r

Erik Bernhardsson @fulhack

Tell me why I'm wrong but "data mesh" seems like another weird data fad to me... enforcing proper ownership of data makes sense, but I think the rest will be used to rationalize data silos and heterogenous data systems – terrible if you want to get value out of data

Atlan: The Rise of the Metadata Lake

Modern business operations increasingly depend on data to derive their business. As the data takes the central role in the business operation, the number of stakeholders interacting with the data is more diverse than ever. In this increasingly diverse data world, metadata holds the key to the elusive promised land. Is it a time to think about metadata lake? The blog narrates the role of metadata lake in the modern data stack.

https://towardsdatascience.com/the-rise-of-the-metadata-lake-1e95127594de

Google AI: Data Cascades in Machine Learning

Data is a foundational aspect of machine learning (ML) that can impact ML systems' performance, fairness, robustness, and scalability. Paradoxically, while building ML models are often highly prioritized, the work related to data is often the least prioritized aspect. The blog summarizes the recent ACM paper Everyone wants to do the model work, not the data work: Data Cascades in High-Stakes AI, and discuss how to address the data cascading effects.

https://ai.googleblog.com/2021/06/data-cascades-in-machine-learning.html

Uber: The Evolution of Data Science Workbench

Uber writes about the evolution of its data science workbench, narrating the efficient scheduling, easier Apache Spark integration with the workspace, and package dependency management. The three key learning in the blog is educational read.

Build for the experts, design for the less technical users
Don’t stop at building what’s known; empower people to look for the unknown
Create communities with both data scientists and non-data scientists

https://eng.uber.com/evolution-ds-workbench/

Benchling: Building a version-controlled Data Aquarium

Benchling writes about its evolution of data infrastructure from a legacy warehouse to a continuous data pipeline tuned to increase the analyst velocity. The discussion around the challenges of implementing continuous integration, how data infrastructure is different from a traditional web application, and how Snowflake's zero-copy data clone helped achieve continuous data integration is an exciting read.

https://benchling.engineering/building-a-version-controlled-data-aquarium-976d17fbdd20

Shopify: Deleting the Undeletable

Deleting the data is one of the most complicated problems in data engineering. Often the data infrastructure misses the dependency graph, field-level context on PII data. Shopify writes an exciting blog highlighting the challenges of PII data and talks about its schematization system, obfuscation & enrichment strategy.

https://shopifyengineering.myshopify.com/blogs/engineering/managing-pii-shopify-scale

Expedia: Powering Self-Service Business Intelligence across Expedia Group

The Lakehouse design provides a delicate balance between the complicated data warehouses and inconsistent data lake systems. Expedia writes about their adoption of Lakehouses, the extension of lakehouse to domain-specific DataLakeMart, and OLAP systems.

https://medium.com/expedia-group-tech/powering-self-service-business-intelligence-across-expedia-group-e3d029a7d1f6

eBay: Optimizing Analytics Data Processing on eBay’s New Open-Source-Based Platform

Tuning a data pipeline requires a layered approach to achieve SLA timelines. eBay writes about the various layers to consider while tuning the Spark jobs, such as system level, the process, table optimization, SQL optimization & the Apache Spark job config parameter tuning. The structured debugging approach is a delight to read, and this is the one spot the data infrastructure needs a lot of attention, from manual tuning to automated pipeline tuning.

https://tech.ebayinc.com/engineering/optimizing-analytics-data-processing-on-ebays-new-open-source-based-platform/

Capital One: End-to-End Models for Complex AI Tasks

The main advantage of machine learning over traditional software engineering is that it allows one to build a component that performs a task by training a model from data, which removes the need for a human to precisely perform the task. Why can't we adopt end-to-end ML rather than part of the tasks? Capital One writes about the pros & cons of adopting an end-to-end ML model and the challenges ahead to reach the promising land of the end-to-end ML model.

https://medium.com/capital-one-tech/end-to-end-models-for-complex-ai-tasks-8c34080145cd

Yelp: Modernizing Business Data Indexing

Serving the computed metrics to the end-user in an acceptable latency is critical for an enriched user experience. Yelp writes about its journey of business data indexing system that queried the MySQL table to stream-based CDC system that leverages Kafka, Flink, Apache Beam & Cassandra.

https://engineeringblog.yelp.com/2021/06/modernizing-business-data-indexing.html

Ashley Melanson: Open Source Spotlight - How Dbt Can Transform Your Data Analytics Pipeline

I would be surprised if you've not heard or played around with DBT by now. If you've not done so far, the author did a great write-up breaking down the components of DBT.

https://ashleymellz.medium.com/open-source-spotlight-how-dbt-can-transform-your-data-analytics-pipeline-c54cf9516cdf

Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers' opinions.

Data Engineering Weekly