Data Engineering Weekly #52

Weekly Data Engineering Newsletter

Aug 16, 2021

Welcome to the 52nd edition of the data engineering newsletter. This week's release is a new set of articles that focus on a few thoughts on data mesh discussion, Pedram Navid's the last thing I'll ever say about the data mesh, LakeFs's what can replace Hive meta store, Apache Hudi's platform overview, Continual's Is Data-First AI the Next Big Thing? Open Lineage's data quality Open Lineage facets, Airflow & GreatExpectations, RudderStack's Why It's Hard for Engineering to Support Marketing, Uber's data platform cost optimization, SQLGlot, a Python SQL parser & transpiler for SparkSQL, Hive, Presto & Trino, and the catalog of AI/DL university lecture.

Data Engineering Weekly - Brought to You by RudderStack - the Customer Data Platform for Developers

RudderStack provides data pipelines that make it easy to collect data from every application, website, and SaaS platform, then activate it in your warehouse and business tools.

Thoughts on Data Mesh Discussions

Last week Data Twitter had some interesting discussions around Data Mesh. I share a few of my thoughts on the data mesh discussion here.

Almost five years ago, I met Dave (not the real name) at one of the tech conferences. Dave is the principal architect for one of the largest retail banks in Canada. We were exchanging common challenges in the data ingestion, observability and data silo. I'm fascinated and explains how the ingestion framework I worked on has in-build observability and scalable architecture. Dave responds to me with a smiling face, "Ananth, do you think a large bank with the capital at disposal can't buy one of these systems. I don't have a data silo, but a people silo. People hold on to their data as a negotiation tool, so all data problem becomes a trade-off resulting inefficient workaround. How will you fix the people silo and free the data?"

The people silo problem is still valid in most organizations. IMO I don't see any scalability issue with a monolithic architecture where storage and compute can scale independently. The multi-tenant centralized storage with a clear separation of concern can scale with proper tooling. DBT is solving the silo with the domain view structure, but the instrumentation part is still challenging.

I genuinely believe a concept like Data Mesh and domain ownership much-needed one to validate data systems similar to the CAP theorem for distributed systems. CAP theorem is not perfect either, but it is good enough to validate. Any vague and misleading concept leads to multiple interpretations that will only result in chaotic culture.

The author writes an excellent reflection article on Data Mesh about the understanding and confusion that requires clarifications.

Pedram Navid: The Last Thing I'll Ever Say About the Data Mesh

On a positive development on this, Data Mesh book now offers code for limited free access to the O'Reilly platform for a bit with a new chapter every month.

Zhamak Dehghani @zhamakd

This code gives you free access to the O’Reilly platform (for a bit) learning.oreilly.com/get-learning/?… 2/3

learning.oreilly.comGet O’Reilly Online Learning

I believe the goal of Data Mesh is to spread the democratize data accessibility and break organization silo. A reference architecture for data mesh and a clear demonstration of how domain ownership brings accountability can simplify the concept by miles.

LakeFs: Hive Metastore – Why It’s Still Here and What Can Replace It?

Hive meta store is a critical component in the interception of all query engines path provides a virtualization layer between the storage and compute. LakeFs write an exciting article on how Hive meta store sustained last ten years while the Hadoop popularity declined. The article predicts the possible components that can succeed the Hive meta store.

https://lakefs.io/hive-metastore-why-its-still-here-and-what-can-replace-it/

Apache Hudi: Apache Hudi - The Data Lake Platform

Apache Hudi pioneered the serverless transactional layer for event logs that significantly shape the data infrastructure. The article gives an in-depth overview of Apache Hudi's building blocks and future roadmap aligning with its founding principle.

https://hudi.apache.org/blog/2021/07/21/streaming-data-lake-platform/

Continual: Is Data-First AI the Next Big Thing?

Continual writes about the evolution of ML platforms from collaboration-centric to model-based to data-centric platforms. The blog is an exciting read on how one generation platform abstraction leads the next-generation platform and democratization of ML/AI engineering in the last ten years.

https://continual.ai/post/is-data-first-ai-the-next-big-thing

Open Lineage: Expecting Great Quality with OpenLineage Facets

The data quality defines the success of a data-driven organization.

The blog is an excellent reminder of why no data is better than bad data. The article narrates the traceability of data quality with OpenLineage Facets integration with Airflow & Great Expectations.

https://openlineage.io/blog/dataquality_expectations_facet/

Uber: Cost-Efficient Open Source Big Data Platform at Uber

An ever-growing data generation adds pressure on the cost of operations to the data infrastructure. Cost optimization is a critical architectural constraint in modern data infrastructure. Uber writes its experience on optimizing cost on data storage, computing & querying layer. S3 tiered storage provides similar optimization for AWS on the storage.

https://eng.uber.com/cost-efficient-big-data-platform/

SQLGlot: Python SQL Parser and Transpiler

Presto/ Trino is an excellent query engine for the exploration stage of analysis but not providing sufficient fault tolerance like Spark SQL/Hive for the production pipeline. It is a painful task to convert SQL from one engine to another. I recently came across SQLGlot with the promise of automating it. I've not tested it, but I'm excited about this tool.

https://github.com/tobymao/sqlglot

Advance ML/DL University Lectures

I recently came across this catalog of advanced ML/ DL university lectures. Kudos to the list's curator, and I hope it will benefit the data engineering weekly community.

https://docs.google.com/spreadsheets/d/1KYJ9Z8f76WZGYpT2E5sjr5gL-O35Lpjm-SMmU00fplk/htmlview

Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers' opinions.

Data Engineering Weekly