Data Catalog - A Broken Promise
A critique on Data Catalog, and the future of knowledge management
Data catalogs are the most expensive data integration systems you never intended to build. Data Catalog as a passive web portal to display metadata requires significant rethinking to adopt modern data workflow, not just adding “modern” in its prefix.
I know that is an expensive statement to make😊 To be fair, I’m a big fan of data catalogs, or metadata management, to be precise. I even authored a special edition capturing metadata tool’s history at Data Engineering Weekly.
What changed my thoughts on Data Catalog?
After overseeing a couple of data catalog implementations in recent times, it made me pause and started to question my belief. The question is essentially two-fold.
Why is it so expensive in terms of the level of effort to roll out a data catalog solution?
Despite the initial energy from the stakeholders, why does the usage of Data Catalogs keep declining?
Is that the experience unique to me? So I seek data community thoughts about Data Catalog in a poll on LinkedIn.
If you discount a few data catalog vendor votes from the poll, 26% shrinks to 20%. So it’s not me; 80% of people think Data Catalog is not a prime-time data workflow system but a handy tool that is sometimes somewhat useful.
The result reaffirms my experience with the Data Catalog, and also it triggers more curiosity in me to understand why so? To understand better, Let’s step back and examine the data catalog of pre-modern-era and modern-era
Data Engineering.The pre-modern(?) era of Data Catalog
Let’s call the pre-modern era; as the state of Data Warehouses before the explosion of big data and subsequent cloud data warehouse adoption. Applications deployed in a large monolithic web server with all the data warehouse changes go through a central data architecture team.
A couple of important characteristics of a Data Warehouse at this time
The ETL tools and Data Warehouse appliances are limited in scope. There are not many sources to pull the metadata.
The footprint of people in an organization directly accessing the Data Warehouse is fairly limited; getting access to query the Data Warehouse directly is a privilege and a specialized skill.
The modern(?) era of Data Catalog
Hadoop significantly reduced the barrier to storing and accessing large volumes of data. The cloud Data Warehouses & LakeHouse systems further accelerate the ease of access to the data. It also opens up data for multiple use cases such as A/B testing
, AI/ML, Growth Engineering, and Data-Driven product features, etc.,All combined, two important characteristics changed in the modern data infrastructure.
Data Warehouses now ingest data from multiple data sources
, which was not possible before, giving unprecedented insights.The ease of access and multiple use cases expose data warehouses to multiple organizational stakeholders and specialized tools to get the job done.
At a high level, we can define modern data engineering as Fn(work) in your favorite tool.
What does that mean? Let’s expand a bit more to demonstrate the current state of the data catalog with the modern data stack.
As you can see, We still adopt the age-old metadata integration strategy to build data catalogs. It makes rolling out the data catalogs.
Expensive and time-consuming
It creates a disjointed workflow which makes folks rarely use the tool
Is Data Catalog a 1980s Solution for 2020’s Problems?
Anthony Algmin writes an interesting perspective reflecting many of the things we sketched in the article with a title, Make An Impact: The Data Catalog is a 1980s Solution for 2020’s Problems. The author nailed the fundamental problem with the Data catalog.
“There’s a bigger problem (*with Data Catalogs). It is that we now need to make potential data catalog users aware that the data catalog exists and then train them how to use it! A system that neither creates the data nor does the analysis—not exactly what most sane people want to spend a lot of time learning. A big strike two!”
The author proposes two ways the Data Catalog can evolve further.
Lose the Interface and embedded into the data creation tools
Expand from Data Catalog to Knowledge Engine - Aka not just a passive web portal, but integrate into the data creation process, aka Data Contract platform.
Conclusion
I don’t think data catalogs are going away soon, but the data catalog tools should acknowledge the underlying system dynamics changes. We can’t design a system that works for 20-year-old infrastructure.
What do you all think about the future of Data Catalogs? I look forward to your comments.
References
https://vwo.com/blog/cro-best-practices-booking/
https://research.netflix.com/research-area/recommendations
https://engineering.gusto.com/what-is-growth-engineering/
https://slack.engineering/recommend-api/
https://airbyte.com/connectors
I like Kendall Willets' comment "the data catalog brings out the inner bureaucrat in all of us"... A data catalog can no longer just be internal wikidata (although documentation is still an important aspect of data governance)... I am co-founder of DAWIZZ, a company that publishes 'MyDataCatalogue' a data catalog which finally after 6 years has become more of a data discovery platform in structured data sources (databases, APIs, . . .) and in unstructured data sources (files, messaging, . . .). We usually say that we create additional metadata from the data itself
We have a project like this currently, and it gives me a very vintage feel, like the data dictionaries of yesteryear. It offers all the accuracy and freshness of software documentation, and it's bringing out the inner bureaucrat in all of us (or at least our data architecture group).
One difference that newer analytics stacks have is that they're often the primary store for event data; there's a tight loop with the product manager or analyst defining the events, devs deploying instrumentation to collect them, and the PM consuming them in the analytics store. The meaning and validity of the data changes very quickly in this situation, but it's easy to evolve if the cost of releasing and measuring is minimized.
I would bet a catalog would be out of date at least 95% of the time in this environment, for a number of reasons. For one instrumentation is often wrong; it has to be tested and debugged just to match the design. It also goes obsolete without being removed; it's a small tech debt that's often worth ignoring. And lastly, frequent releases are unlikely to be coordinated with catalog updates.