Data Engineering Weekly

Share this post
Data Catalog - A Broken Promise
www.dataengineeringweekly.com

Data Catalog - A Broken Promise

A critique on Data Catalog, and the future of knowledge management

Ananth Packkildurai
Dec 30, 2022
14
9
Share this post
Data Catalog - A Broken Promise
www.dataengineeringweekly.com

Data catalogs are the most expensive data integration systems you never intended to build. Data Catalog as a passive web portal to display metadata requires significant rethinking to adopt modern data workflow, not just adding “modern” in its prefix.

I know that is an expensive statement to make😊 To be fair, I’m a big fan of data catalogs, or metadata management, to be precise. I even authored a special edition capturing metadata tool’s history at Data Engineering Weekly.

Data Engineering Weekly
Data Engineering Weekly #21: Metadata Edition
Welcome to the 21st edition of the data engineering newsletter. The 21st edition of the newsletter focuses on the recent breakthroughs in metadata management. I believe the next big set of challenges in data engineering is all about efficient data management…
Read more
2 years ago · 4 likes · Ananth Packkildurai

What changed my thoughts on Data Catalog?

After overseeing a couple of data catalog implementations in recent times, it made me pause and started to question my belief. The question is essentially two-fold.

  1. Why is it so expensive in terms of the level of effort to roll out a data catalog solution?

  2. Despite the initial energy from the stakeholders, why does the usage of Data Catalogs keep declining?

Is that the experience unique to me? So I seek data community thoughts about Data Catalog in a poll on LinkedIn.

How happy are you with your data catalogs?

If you discount a few data catalog vendor votes from the poll, 26% shrinks to 20%. So it’s not me; 80% of people think Data Catalog is not a prime-time data workflow system but a handy tool that is sometimes somewhat useful.

The result reaffirms my experience with the Data Catalog, and also it triggers more curiosity in me to understand why so? To understand better, Let’s step back and examine the data catalog of pre-modern-era and modern-era

1
Data Engineering.

The pre-modern(?) era of Data Catalog

Let’s call the pre-modern era; as the state of Data Warehouses before the explosion of big data and subsequent cloud data warehouse adoption. Applications deployed in a large monolithic web server with all the data warehouse changes go through a central data architecture team.

A couple of important characteristics of a Data Warehouse at this time

  1. The ETL tools and Data Warehouse appliances are limited in scope. There are not many sources to pull the metadata.

  2. The footprint of people in an organization directly accessing the Data Warehouse is fairly limited; getting access to query the Data Warehouse directly is a privilege and a specialized skill.

The modern(?) era of Data Catalog

Hadoop significantly reduced the barrier to storing and accessing large volumes of data. The cloud Data Warehouses & LakeHouse systems further accelerate the ease of access to the data. It also opens up data for multiple use cases such as A/B testing

2
, AI/ML
3
, Growth Engineering
4
, and Data-Driven product features
5
, etc.,

All combined, two important characteristics changed in the modern data infrastructure.

  1. Data Warehouses now ingest data from multiple data sources

    6
    , which was not possible before, giving unprecedented insights.

  2. The ease of access and multiple use cases expose data warehouses to multiple organizational stakeholders and specialized tools to get the job done.

At a high level, we can define modern data engineering as Fn(work) in your favorite tool.

What does that mean? Let’s expand a bit more to demonstrate the current state of the data catalog with the modern data stack.

As you can see, We still adopt the age-old metadata integration strategy to build data catalogs. It makes rolling out the data catalogs.

  1. Expensive and time-consuming

  2. It creates a disjointed workflow which makes folks rarely use the tool

Is Data Catalog a 1980s Solution for 2020’s Problems?

Anthony Algmin writes an interesting perspective reflecting many of the things we sketched in the article with a title, Make An Impact: The Data Catalog is a 1980s Solution for 2020’s Problems. The author nailed the fundamental problem with the Data catalog.

“There’s a bigger problem (*with Data Catalogs). It is that we now need to make potential data catalog users aware that the data catalog exists and then train them how to use it! A system that neither creates the data nor does the analysis—not exactly what most sane people want to spend a lot of time learning. A big strike two!”

The author proposes two ways the Data Catalog can evolve further.

  1. Lose the Interface and embedded into the data creation tools

  2. Expand from Data Catalog to Knowledge Engine - Aka not just a passive web portal, but integrate into the data creation process, aka Data Contract platform.

Conclusion

I don’t think data catalogs are going away soon, but the data catalog tools should acknowledge the underlying system dynamics changes. We can’t design a system that works for 20-year-old infrastructure.

What do you all think about the future of Data Catalogs? I look forward to your comments.

References

1
Twitter avatar for @ananthdurai
at-ananth-at-data-folks dot m@ st0 dot h0st @ananthdurai
I can't believe the modern data stack is already ten years old!!! Time flies. If you got confused, Redshift's initial release was oct-2012 :-) 🧵What is your take on the age of the modern data stack?
1:25 PM ∙ Apr 21, 2022
17Likes4Retweets
2

https://vwo.com/blog/cro-best-practices-booking/

3

https://research.netflix.com/research-area/recommendations

4

https://engineering.gusto.com/what-is-growth-engineering/

5

https://slack.engineering/recommend-api/

6

https://airbyte.com/connectors

9
Share this post
Data Catalog - A Broken Promise
www.dataengineeringweekly.com
9 Comments
Stéphane LE LIONNAIS
Jan 18Liked by Ananth Packkildurai

I like Kendall Willets' comment "the data catalog brings out the inner bureaucrat in all of us"... A data catalog can no longer just be internal wikidata (although documentation is still an important aspect of data governance)... I am co-founder of DAWIZZ, a company that publishes 'MyDataCatalogue' a data catalog which finally after 6 years has become more of a data discovery platform in structured data sources (databases, APIs, . . .) and in unstructured data sources (files, messaging, . . .). We usually say that we create additional metadata from the data itself

Expand full comment
ReplyCollapse
Kendall Willets
Dec 31, 2022Liked by Ananth Packkildurai

We have a project like this currently, and it gives me a very vintage feel, like the data dictionaries of yesteryear. It offers all the accuracy and freshness of software documentation, and it's bringing out the inner bureaucrat in all of us (or at least our data architecture group).

One difference that newer analytics stacks have is that they're often the primary store for event data; there's a tight loop with the product manager or analyst defining the events, devs deploying instrumentation to collect them, and the PM consuming them in the analytics store. The meaning and validity of the data changes very quickly in this situation, but it's easy to evolve if the cost of releasing and measuring is minimized.

I would bet a catalog would be out of date at least 95% of the time in this environment, for a number of reasons. For one instrumentation is often wrong; it has to be tested and debugged just to match the design. It also goes obsolete without being removed; it's a small tech debt that's often worth ignoring. And lastly, frequent releases are unlikely to be coordinated with catalog updates.

Expand full comment
ReplyCollapse
7 more comments…
TopNewCommunity

No posts

Ready for more?

© 2023 Ananth Packkildurai
Privacy ∙ Terms ∙ Collection notice
Start WritingGet the app
Substack is the home for great writing