A critique on Data Catalog, and the future of knowledge management
Ananth, very nicely written article. Data catalog / hub (whatever name we give) is always a gigantic data integration effort. In my opinion, even pre-modern era was no different. There were too metadata producers.
When I initiated WhereHows project project in 2013 at LinkedIn, we had at least 3 orchestration engines - Azkaban, Appworx (UC4) & Informatica. Then data moved through HDFS into MPP database with some of the heavy lifting ETL happening in Java map-reduce, Pig and finally Teradata SQL (final aggregation) ending up with reports using MicroStrategy and Tableau.
I give the example of Google Map - it is a wonderful application not just because it helps me to get from a source (one data entity) to a destination (another data entity) but can superimpose nearby coffee shops with ratings when I need to wait somewhere. Often bare backbone data lineage may not be able to answer all questions. What I noticed at LinkedIn, data lineage mashed with process lineage (which pipelines / tasks where involved in the transformations, pattern of data volume that they carry everyday) provide interesting insights for support, operational excellence, initial troubleshooting etc
We kept extending this notion to source code repository because we needed to know the code owners, who made latest changes etc. To complicate the scenario, some metadata producers like UC4, Informatica were very proprietary in nature - so yes, it was a very large, complex and fragile data integration project and we needed to remember breakage anywhere would make the lineage graph unusable.
I like Kendall Willets' comment "the data catalog brings out the inner bureaucrat in all of us"... A data catalog can no longer just be internal wikidata (although documentation is still an important aspect of data governance)... I am co-founder of DAWIZZ, a company that publishes 'MyDataCatalogue' a data catalog which finally after 6 years has become more of a data discovery platform in structured data sources (databases, APIs, . . .) and in unstructured data sources (files, messaging, . . .). We usually say that we create additional metadata from the data itself
We have a project like this currently, and it gives me a very vintage feel, like the data dictionaries of yesteryear. It offers all the accuracy and freshness of software documentation, and it's bringing out the inner bureaucrat in all of us (or at least our data architecture group).
One difference that newer analytics stacks have is that they're often the primary store for event data; there's a tight loop with the product manager or analyst defining the events, devs deploying instrumentation to collect them, and the PM consuming them in the analytics store. The meaning and validity of the data changes very quickly in this situation, but it's easy to evolve if the cost of releasing and measuring is minimized.
I would bet a catalog would be out of date at least 95% of the time in this environment, for a number of reasons. For one instrumentation is often wrong; it has to be tested and debugged just to match the design. It also goes obsolete without being removed; it's a small tech debt that's often worth ignoring. And lastly, frequent releases are unlikely to be coordinated with catalog updates.
Data catalogs (or lineage, metadata, and definition) are most useful when generated automatically. +1 to the idea of adding catalog integration and identification features into existing tools, focused on:
1) who owns it
2) what does it do
3) does it depend on anything else
I still question myself each day when I chase Data Catalogs or the idea of selling it to the customers. Data Catalogs as a concept is complex for modern stack, one should only extract inspiration to draw metadata for modern stack and Contracts have no doubt proved effective tool, yet there is no solution for drawing those contracts quickly with existing pipelines, would love to be corrected here!!
Data Catalogs never delivered on their promise. It's time for them to go the way of the dinosaur.
I think it depends what is the usage intended. For analyst to find "interesting" tables.. - hard.
For system owners and de to manage and govern the eco system (who owns what.. lineage/dependencies.. etc..) - very important and does the job.