Discover more from Data Engineering Weekly
Data Catalog - A Broken Promise
A critique on Data Catalog, and the future of knowledge management
Data catalogs are the most expensive data integration systems you never intended to build. Data Catalog as a passive web portal to display metadata requires significant rethinking to adopt modern data workflow, not just adding “modern” in its prefix.
I know that is an expensive statement to make😊 To be fair, I’m a big fan of data catalogs, or metadata management, to be precise. I even authored a special edition capturing metadata tool’s history at Data Engineering Weekly.
What changed my thoughts on Data Catalog?
After overseeing a couple of data catalog implementations in recent times, it made me pause and started to question my belief. The question is essentially two-fold.
Why is it so expensive in terms of the level of effort to roll out a data catalog solution?
Despite the initial energy from the stakeholders, why does the usage of Data Catalogs keep declining?
Is that the experience unique to me? So I seek data community thoughts about Data Catalog in a poll on LinkedIn.
If you discount a few data catalog vendor votes from the poll, 26% shrinks to 20%. So it’s not me; 80% of people think Data Catalog is not a prime-time data workflow system but a handy tool that is sometimes somewhat useful.
The result reaffirms my experience with the Data Catalog, and also it triggers more curiosity in me to understand why so? To understand better, Let’s step back and examine the data catalog of pre-modern-era and modern-era1 Data Engineering.
The pre-modern(?) era of Data Catalog
Let’s call the pre-modern era; as the state of Data Warehouses before the explosion of big data and subsequent cloud data warehouse adoption. Applications deployed in a large monolithic web server with all the data warehouse changes go through a central data architecture team.
A couple of important characteristics of a Data Warehouse at this time
The ETL tools and Data Warehouse appliances are limited in scope. There are not many sources to pull the metadata.
The footprint of people in an organization directly accessing the Data Warehouse is fairly limited; getting access to query the Data Warehouse directly is a privilege and a specialized skill.
The modern(?) era of Data Catalog
Hadoop significantly reduced the barrier to storing and accessing large volumes of data. The cloud Data Warehouses & LakeHouse systems further accelerate the ease of access to the data. It also opens up data for multiple use cases such as A/B testing2, AI/ML3, Growth Engineering4, and Data-Driven product features5, etc.,
All combined, two important characteristics changed in the modern data infrastructure.
Data Warehouses now ingest data from multiple data sources6, which was not possible before, giving unprecedented insights.
The ease of access and multiple use cases expose data warehouses to multiple organizational stakeholders and specialized tools to get the job done.
At a high level, we can define modern data engineering as Fn(work) in your favorite tool.
What does that mean? Let’s expand a bit more to demonstrate the current state of the data catalog with the modern data stack.
As you can see, We still adopt the age-old metadata integration strategy to build data catalogs. It makes rolling out the data catalogs.
Expensive and time-consuming
It creates a disjointed workflow which makes folks rarely use the tool
Is Data Catalog a 1980s Solution for 2020’s Problems?
Anthony Algmin writes an interesting perspective reflecting many of the things we sketched in the article with a title, Make An Impact: The Data Catalog is a 1980s Solution for 2020’s Problems. The author nailed the fundamental problem with the Data catalog.
“There’s a bigger problem (*with Data Catalogs). It is that we now need to make potential data catalog users aware that the data catalog exists and then train them how to use it! A system that neither creates the data nor does the analysis—not exactly what most sane people want to spend a lot of time learning. A big strike two!”
The author proposes two ways the Data Catalog can evolve further.
Lose the Interface and embedded into the data creation tools
Expand from Data Catalog to Knowledge Engine - Aka not just a passive web portal, but integrate into the data creation process, aka Data Contract platform.
I don’t think data catalogs are going away soon, but the data catalog tools should acknowledge the underlying system dynamics changes. We can’t design a system that works for 20-year-old infrastructure.
What do you all think about the future of Data Catalogs? I look forward to your comments.