Data Engineering Weekly
Data Engineering Weekly
Data Engineering Weekly Radio #120
0:00
-36:12

Data Engineering Weekly Radio #120

By - Ananth & Ashwin

We are back in our Data Engineering Weekly Radio for edition #120. We will take 2 or 3 articles from each week's Data Engineering Weekly edition and go through an in-depth analysis. 

From editor #120, we took the following articles

Topic 1: Colin Campbell: The Case for Data Contracts - Preventative data quality rather than reactive data quality

In this episode, we focus on the importance of data contracts in preventing data quality issues. We discuss an article by Colin Campbell highlighting the need for a data catalog and the market scope for data contract solutions. We also touch on the idea that data creation will be a decentralized process and the role of tools like data contracts in enabling successful decentralized data modeling. We emphasize the importance of creating high-quality data and the need for technological and organizational solutions to achieve this goal.

Key highlights of the conversation

"Preventative data quality rather than reactive data quality. It should start with contracts." - Colin Campbell. - Author of the article

"Contracts put a preventive structure in place" - Ashwin.

"The successful data-driven companies all do one thing very well. They create high-quality data." - Ananth.

Colin’s Substack
The Case for Data Contracts
Summary Companies lose large amounts of time and money due to data quality issues with data that feeds into a value-generating use (e.g., data quality issues with “production data”). After data is produced, it must travel through multiple uncoordinated software vendors that store and manipulate data, often resulting in poor data quality that breaks the s…
Read more

Ananth’s post on Schemata

Data Engineering Weekly
Introducing Schemata - A Decentralized Schema Modeling Framework For Modern Data Stack
I’m thrilled to write about Schemata, a decentralized schema modeling framework for data contracts. Oh, wait, all the jargon, what is it? Let me take you all on the Schemata journey. You can find the source code and the documentation here. GitHub Repo…
Read more

Topic 2: Yerachmiel Feltzman: Action-Position data quality assessment framework

In this conversation, we discuss a framework for data quality assessment called the Action Position framework. The framework helps define what actions should be taken based on the severity of the data quality problem. We also discuss two patterns for data quality: Write-Audit-Publish (WAP) and Audit-Write-Publish (AWP). The WAP pattern involves writing data, auditing it, and publishing it, while the AWP pattern involves auditing data, writing it, and publishing it. We encourage readers to share their best practices for addressing data quality issues.

Are you using any Data Quality framework in your organization? Do you have any best practices on how you address data quality issues? What do you think of the action-position data quality framework? Please add your comments in the SubStack chat.

https://medium.com/everything-full-stack/action-position-data-quality-assessment-framework-d833f6b77b7

Dremio WAP pattern: https://www.dremio.com/resources/webinars/the-write-audit-publish-pattern-via-apache-iceberg/


Topic 3: Guy Fighel - Stop emphasizing the Data Catalog

We discuss the limitations of data catalogs and the author’s view on the semantic layer as an alternative. The author argues that data catalogs are passive and quickly become outdated and that a stronger contract with enforced data quality could be a better solution. We also highlight the cost factors of implementing a data catalog and suggest that a more decentralized approach may be necessary to keep up with the increasing number of data sources. Innovation in this space is needed to improve organizations' discoverability and consumption of data assets.

Something to think about in this conversation

"If you don't catalog everything and we only catalog what is required for the purpose of business decision-making, does that solve the data catalog problem in an organization?"

https://www.linkedin.com/pulse/stop-emphasizing-data-catalog-guy-fighel/

Data Engineering Weekly
Data Catalog - A Broken Promise
Data catalogs are the most expensive data integration systems you never intended to build. Data Catalog as a passive web portal to display metadata requires significant rethinking to adopt modern data workflow, not just adding “modern” in its prefix…
Read more

Discussion about this podcast

Data Engineering Weekly
Data Engineering Weekly
The Weekly Data Engineering Newsletter
Listen on
Substack App
RSS Feed
Appears in episode
Ananth Packkildurai
Aswin James Christy