Discussion about this post

User's avatar
Abhijit Bhattacharya's avatar

Ananth, very nicely written article. Data catalog / hub (whatever name we give) is always a gigantic data integration effort. In my opinion, even pre-modern era was no different. There were too metadata producers.

When I initiated WhereHows project project in 2013 at LinkedIn, we had at least 3 orchestration engines - Azkaban, Appworx (UC4) & Informatica. Then data moved through HDFS into MPP database with some of the heavy lifting ETL happening in Java map-reduce, Pig and finally Teradata SQL (final aggregation) ending up with reports using MicroStrategy and Tableau.

I give the example of Google Map - it is a wonderful application not just because it helps me to get from a source (one data entity) to a destination (another data entity) but can superimpose nearby coffee shops with ratings when I need to wait somewhere. Often bare backbone data lineage may not be able to answer all questions. What I noticed at LinkedIn, data lineage mashed with process lineage (which pipelines / tasks where involved in the transformations, pattern of data volume that they carry everyday) provide interesting insights for support, operational excellence, initial troubleshooting etc

We kept extending this notion to source code repository because we needed to know the code owners, who made latest changes etc. To complicate the scenario, some metadata producers like UC4, Informatica were very proprietary in nature - so yes, it was a very large, complex and fragile data integration project and we needed to remember breakage anywhere would make the lineage graph unusable.

Expand full comment
Stéphane LE LIONNAIS's avatar

I like Kendall Willets' comment "the data catalog brings out the inner bureaucrat in all of us"... A data catalog can no longer just be internal wikidata (although documentation is still an important aspect of data governance)... I am co-founder of DAWIZZ, a company that publishes 'MyDataCatalogue' a data catalog which finally after 6 years has become more of a data discovery platform in structured data sources (databases, APIs, . . .) and in unstructured data sources (files, messaging, . . .). We usually say that we create additional metadata from the data itself

Expand full comment
7 more comments...