Data Catalog - A Broken Promise

Dec 30, 2022

A critique on Data Catalog, and the future of knowledge management

9 Comments

Jan 26, 2023Edited

Ananth, very nicely written article. Data catalog / hub (whatever name we give) is always a gigantic data integration effort. In my opinion, even pre-modern era was no different. There were too metadata producers.

When I initiated WhereHows project project in 2013 at LinkedIn, we had at least 3 orchestration engines - Azkaban, Appworx (UC4) & Informatica. Then data moved through HDFS into MPP database with some of the heavy lifting ETL happening in Java map-reduce, Pig and finally Teradata SQL (final aggregation) ending up with reports using MicroStrategy and Tableau.

I give the example of Google Map - it is a wonderful application not just because it helps me to get from a source (one data entity) to a destination (another data entity) but can superimpose nearby coffee shops with ratings when I need to wait somewhere. Often bare backbone data lineage may not be able to answer all questions. What I noticed at LinkedIn, data lineage mashed with process lineage (which pipelines / tasks where involved in the transformations, pattern of data volume that they carry everyday) provide interesting insights for support, operational excellence, initial troubleshooting etc

We kept extending this notion to source code repository because we needed to know the code owners, who made latest changes etc. To complicate the scenario, some metadata producers like UC4, Informatica were very proprietary in nature - so yes, it was a very large, complex and fragile data integration project and we needed to remember breakage anywhere would make the lineage graph unusable.

Expand full comment

Stéphane LE LIONNAIS

Jan 18, 2023

I like Kendall Willets' comment "the data catalog brings out the inner bureaucrat in all of us"... A data catalog can no longer just be internal wikidata (although documentation is still an important aspect of data governance)... I am co-founder of DAWIZZ, a company that publishes 'MyDataCatalogue' a data catalog which finally after 6 years has become more of a data discovery platform in structured data sources (databases, APIs, . . .) and in unstructured data sources (files, messaging, . . .). We usually say that we create additional metadata from the data itself

Expand full comment

Kendall Willets

Dec 31, 2022

We have a project like this currently, and it gives me a very vintage feel, like the data dictionaries of yesteryear. It offers all the accuracy and freshness of software documentation, and it's bringing out the inner bureaucrat in all of us (or at least our data architecture group).

One difference that newer analytics stacks have is that they're often the primary store for event data; there's a tight loop with the product manager or analyst defining the events, devs deploying instrumentation to collect them, and the PM consuming them in the analytics store. The meaning and validity of the data changes very quickly in this situation, but it's easy to evolve if the cost of releasing and measuring is minimized.

I would bet a catalog would be out of date at least 95% of the time in this environment, for a number of reasons. For one instrumentation is often wrong; it has to be tested and debugged just to match the design. It also goes obsolete without being removed; it's a small tech debt that's often worth ignoring. And lastly, frequent releases are unlikely to be coordinated with catalog updates.

Expand full comment

Greg Meyer

Dec 30, 2022

Data catalogs (or lineage, metadata, and definition) are most useful when generated automatically. +1 to the idea of adding catalog integration and identification features into existing tools, focused on:

1) who owns it

2) what does it do

3) does it depend on anything else

Expand full comment

Varun Saraogi

Dec 30, 2022

I still question myself each day when I chase Data Catalogs or the idea of selling it to the customers. Data Catalogs as a concept is complex for modern stack, one should only extract inspiration to draw metadata for modern stack and Contracts have no doubt proved effective tool, yet there is no solution for drawing those contracts quickly with existing pipelines, would love to be corrected here!!

Expand full comment

Daniel Beach

Dec 30, 2022

Data Catalogs never delivered on their promise. It's time for them to go the way of the dinosaur.

Expand full comment

Koren

Dec 31, 2022

I think it depends what is the usage intended. For analyst to find "interesting" tables.. - hard.

For system owners and de to manage and govern the eco system (who owns what.. lineage/dependencies.. etc..) - very important and does the job.

Expand full comment

Reply (1)

GabTanTechInfo

Jan 2, 2023

If it is hard for analysts to find "interesting" tables then either the data catalog search / discovery feature is weak / bad, or - much more likely - it has not been supported with classifications, descriptions, comments and tags applied to those tables.

Is there a business information model linked to the tables / data assets? Or at least a glossary of business terms?

If not, then yeah, good luck to the analysts trying to find something just by column names...

Expand full comment

Reply (1)

Koren

Jan 2, 2023

I agree. But - even with those, at least in our organization, it worked better to correlate and recommend tables based on what current data the analyst works on (on the same system he is now looking at).

Or just have him "search" for information and get tables related to it.

the user story of "analyst looking for tables" is not a strong one. (at least in our org which have a big challenge of variety.

Expand full comment

Data Engineering Weekly

Data Catalog - A Broken Promise