Introducing Schemata - A Decentralized Schema Modeling Framework For Modern Data Stack

Data Contracts & Decentralized Data Ownership, and more

Aug 21, 2022

I’m thrilled to write about Schemata, a decentralized schema modeling framework for data contracts. Oh, wait, all the jargon, what is it? Let me take you all on the Schemata journey. You can find the source code and the documentation here.

GitHub Repo: https://github.com/ananthdurai/schemata

Why do we need Schemata?

The TV Remote Under The Couch Problem:

Let’s first admit it, The era of running the “DataWarehouse in a Box” is history now. The cloud data warehouses and the modern data stack significantly simplified the infrastructure complexity and eased access to the data. The Data Warehouses are more developer friendly to ask questions, transform the data, and derive insights.

The simplification of data processing also brings a problem; as my good friend Tamás Németh quoted, “The TV Remote Under the Couch Problem.”

You want to watch the TV, and you misplace the remote under the couch. If you find it, you’re lucky. If not, you ask if anyone has seen the remote. If nothing workout, you buy a new remote. It may be a single click “Buy Now” in most cases.

The modern data stack is like your missing remote under the couch. If we don’t find the required data, writing your dbt transformation job is relatively quick but has a significant cost associated with it. We can’t keep buying the TV remote, as it will degrade the Benefit-Cost Ratio (BCR).

The Garbage In, Garbage Out (GIGO) problem:

The TV Remote under the couch problem creates a far more dangerous consequence. It creates a Garbage In, Garbage Out model.

Until Schemata, there is no systematic way to measure the integrity of the data model. We keep building new data models with no feedback loop to balance the cost and integrity of the data assets. It creates the Garbage-In Garbage-Out model.

The GIGO problem is like a virus on organizational knowledge management systems, a significant business differentiator for many companies.

The Producer-Consumer problem in Data Lake:

Data Lake (or LakeHouse) becomes the defacto architecture pattern to source events and produce analytical insights. The Data Lake inherently creates a producer-consumer relationship between the product feature team and the data engineering team. As Data Lake grows, the complexity of data management grows. Let’s take an everyday data flow in a typical data management.

The data producer generates data for the product feature they develop and sends it to the data lake. (Either as a ProtoBuf/ Avro/ Thrift if you’re lucky or Json format if you like data adventure)
The consumers down the line have no domain understanding of the producer and struggle to understand the data lake data.
The consumers then connect with the data producer to understand the data to the producer’s domain expert. The domain expert may not have the context, or human knowledge may not be available.

The Data Lake becomes a technical debt rather than a strategic advantage as it becomes trash storage rather than data as an asset.

How do Schemata solve the Garbage-In Garbage-Out (GIGO) problem?

Schemata Enable Domain-Oriented Data Ownership

Schemata focus on treating data as a product. The feature team that works on the product feature has the domain understanding of the data, not the data's consumer. Schemata enable the data ownership to the feature team to create, attach metadata, catalog the data, and store it for easier consumption.

The data curation and the cataloging of the data at the data creation phase bring more visibility and make it easier for consumption. The process also eliminates the human knowledge silo and truly democratizes the data. It helps the data consumers not worry about the data discovery and focuses on producing value from the data.

Schemata Facilitate Decentralized Data Modeling

Traditionally upfront data modeling comes with a cost. A centralized data architecture/ modeling team often coordinates with multiple teams to design an enterprise data modeling. It is hard for one individual human to hold the entire company's data architecture in their head. The data modeling tools don't reflect the current state of the data modeling. Decentralized data modeling is the only scalable approach, and Schemata enables the bottom-up crowdsourcing data modeling approach to democratize data access in an organization.

Schemata Bring the DevOps principle to data modeling.

The decentralized data modeling principle brings a unique collaborative approach to managing the data asset's lifecycle. It brings all the proven devops principles like ownership, accountability, collaboration, automation, continuous improvement, and customer-centric action to data management.

Schemata Enforce the Connectivity & Integrity of the Data Model

Data is inherently social in nature.

The significant challenge of decentralized data management is that the lack of connectivity among the data will degrade the usability of the data. Schemata is an opinionated data modeling framework that programmatically measures the connectivity of the data model and assigns a score to it. We call this Schemata Score.

Observability metrics like SLO & Apdex Score inspired the formation of Schemata Score. A lower Schemata Score means lesser data connectivity of a data model. It allows the teams collaboratively fix the data model and bring uniformity to the data.

Conclusion:

Since the incarnation of Hadoop & MapReduce, the data engineering community has significantly focused on commoditizing data transformation. All the Hadoop abstractions like Hive, Pig, Crunch, et al. building on top of Hadoop to further simplify the data transformation. Apache Spark to dbt the data engineering community made a significant leap forward by simplifying the data transformation.

However, during this time, we made zero to no improvement on data modeling & data contracts to build the proper knowledge management system. Data Catalog acts like a search index platform for the data assets. There are many exciting breakthroughs in the data discovery space, but they are not good enough to build a systematic feedback loop and knowledge graph.

Is Schemata the savior? Honestly, I don’t know the answer, and I’m looking forward to hearing from you all on this. But I know that a Schemata-like feedback loop is missing in the modern data stack.

Curious to know more about Schemata?