Data Mesh Simplified: A Reflection Of My Thoughts On Data Mesh
A simplified narration of data mesh principles, and comparison with the Data Lake
The Rise of Data Mesh
Data Mesh is a set of data engineering principles coined by Zhamak Dehghani from ThoughtWorks. I highly recommend reading How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh and Data Mesh Principles and Logical Architecture.
Data Mesh influenced by domain-driven design, emphasizing the importance of data ownership and shared tooling to generate, curate, and democratize data. Data Mesh principles getting good adaption in some of the leading organizations. Yelp talked about its data mesh journey, So does Netflix and Zalando.
Though the literacy around data mesh maturing, I see a few confusions around Data Mesh in a few blogs. I’m not an authoritative person to describe data mesh, but an attempt to convey what I learned about data mesh. I’m eager to hear the alternate viewpoints on this.
The sad state of data engineering
Now the fundamental question you may ask. Why Data Mesh and Why now? To understand Data Mesh, we need to understand the current state of the data engineering world. It may not directly apply to your organization, but most of the data infrastructure remains in this sad state.
Imagine you are writing a dictionary with only the words with no meaning for it. On top of it, you shuffle the words randomly and publish the dictionary without any index for it, and hire high-paid data engineers and analysts to decode the dictionary.
It is the current state of the data infrastructure. The modern data infrastructure has sophisticated systems like Kafka, Spark, and the ability to emit and process events at a petabyte-scale. Yet, the data generation process we follow is equivalent to writing a dictionary without meaning.
Wait, Don’t Data Lake solved It?
A data lake is a centralized repository that allows you to store all your structured and unstructured data at any scale. You can store your data as-is without having to structure the data.
As the definition suggests, Data Lake focuses on centralized data storage to break the organization's data silo. The central repository removes entry barriers to integrate and analyze from various data sources in an organization.
However, as the data lake grows, the complexity of the data management grows.
The producer of the data generates data and sends it to the data lake.
The consumers down the line have no domain understanding of the producer and struggle to understand the data lake data.
The consumers then connect with the data producer to understand the data. At that point, the producer side's domain expertise depends on human knowledge that may or may not available.
As Data Lake grows, it becomes a technical debt rather than a strategic advantage.
How Data Mesh Solve It?
Data Mesh is an enterprise data platform principle that converges the principles from Distributed Domain Driven Architecture, Self-serve Platform Design, and Thinking Data as a Product.
Data Mesh focuses on treating data as a product. The feature team that works on the product feature has the domain understanding of the data, not the data's consumer. Data Mesh pushes the data ownership responsibility to the feature team to create, attach metadata, catalog the data, and store it for easier consumption.
The data curation and the cataloging of the data at the data creation phase bring more visibility to the data and make it easier for consumption. The process also eliminates human knowledge silo and truly democratize the data. It helps the data consumers not worry about the data discovery and focuses on producing value from the data.
How Data Lake Different From Data Mesh?
Data Lake is like a reporter writing an article for the New York Times. The reporter goes and interviews related people to write a story, fact checks it, and delivers a reporter narration to the readers.
Data Mesh is like writing a book for O'Reilly or similar publications. The publication provides a foundational infrastructure for all the authors. The authors write their views, add index, and glossary for the book and deliver their narration to the readers.
But, There Is Always A Catch
Data Mesh sounds very cool, but there is always a catch. This is a great Twitter thread summarizing the challenges in adoption.
As I mentioned, If we blindly adopt Data Mesh principles without the proper tooling, it can easily lead us to the good old org data silo problem. As mentioned in the below Tweet, no tool can fix the problem.
The following threads narrate the observation from the industry and how to adapt the Data Mesh principles.
Data Mesh is not a technology or a storage solution instead of a set of principles to streamline its data management. As Gwen, Sriram, Kishore, and Vinoth pointed out, it is an invisible structure in most organizations and requires proper tooling to enable the Data Mesh principles.
The analogy of evolving monolithic architecture to microservices architecture fits well with the Data Mesh principles. If you are starting up, a monolithic architecture may work well for you. As you grow, focus on building tools to label, catalog, organize, and search your data, leading to the adoption of the Data Mesh principles.
Links are provided for informational purposes and do not imply endorsement. All views expressed in this blog are my own and do not represent current, former, or future employers' opinions.