Data Engineering Weekly #121

The Weekly Data Engineering Newsletter

Mar 06, 2023

Data Engineering Weekly Is Brought to You by RudderStack

RudderStack provides data pipelines that make it easy to collect data from every application, website, and SaaS platform, then activate it in your warehouse and business tools. Sign up free to test out the tool today.

Editor’s Note: Who Owns the Data Contract?

at-ananth-at-data-folks dot m@ st0 dot h0st @ananthdurai

Shift Left in Data Quality: The producer should own the data quality is a misconception. Producers are Accountable for implementing quality rules, whereas the Consumers are Responsible for Data Quality.

The tweet about Data Quality & Data Contract ownership triggers an interesting conversation. I sit with David Jayatillake & Kevin Hu of Metaplane to discuss the ownership and organization dynamics.

https://www.linkedin.com/video/event/urn:li:ugcPost:7036005867741159424/

The conversation also leads to interesting conversations in the data-folks Mastro channel. Link to the conversation

https://techhub.social/@datacequia/109966388223034317

Stanford HAI: Generative AI - Perspectives from Stanford HAI

ChatGPT might write an essay, Midjourney could create beautiful illustrations, or MusicLM could compose a jingle. On the one hand, they may seamlessly complement human labor, making us more productive and creative; on the other, they could amplify the bias we already experience or undermine our trust in information. Stanford HAI published its perspective on Generative AI in this extensive report.

https://hai.stanford.edu/generative-ai-perspectives-stanford-hai

Meta: Four Analytics Best Practices We Adopted — and Why You Should Too

Meta writes about four analytical best practices to ensure the most trustworthy and responsible data-driven decisions across the company. The basics of the best practices are to establish Meta’s Ground Truth Maturity Framework [GTMF]

https://medium.com/@AnalyticsAtMeta/four-analytics-best-practices-we-adopted-and-why-you-should-too-a1058ce5f8af

Google: Datasets at your fingertips in Google Search

Easy access to the datasets is 80% of the problem solved in data engineering. Google provides Dataset Search, a dedicated search engine for datasets, powers this feature and indexes more than 45 million datasets from more than 13,000 websites.

https://ai.googleblog.com/2023/02/datasets-at-your-fingertips-in-google.html

Netflix: Data ingestion pipeline with Operation Management

Netflix writes about a unique challenge of its annotation pipeline: the need to support multiple runs of the same annotation tasks. In the date version partition table, we override the partition or swap the version from one bucket location to another; However, Netflix requires these outputs to be searchable and findable as soon the job is finished. The blog narrates how they overcome the challenge with the combination of Cassandra & ElasticSearch

https://netflixtechblog.medium.com/data-ingestion-pipeline-with-operation-management-3c5c638740a8

Peter Bruins: Some reflections on talking with Data leaders

at-ananth-at-data-folks dot m@ st0 dot h0st @ananthdurai

As the size of the organization grows, the data maturity shrink. The complexity of the data outgrown the usability the data. Have anyone seen this pattern? Curious to know data folks' thoughts on it.

Data Mesh/ Data Product/ Data Contract all the concepts trying to address this problem, and this is a Billion $ $ $ worth of a problem to solve. The author leaves a bigger question, Ownership plays a central role in all these concepts, but what is the incentive to bring Ownership?

https://www.linkedin.com/pulse/some-reflections-talking-data-leaders-peter-bruins/

Faire: The great migration from Redshift to Snowflake

Is Redshift dying? I’m seeing an increasing pattern of people migrating from Redshift to Snowflake or Lakehouse. Flair wrote a detailed blog on the reasoning behind Redshift to Snowflake migration, its journey, and its key takeaway.

https://craft.faire.com/the-great-migration-from-redshift-to-snowflake-173c1fb59a52

Flair also opensource some of the utility scripts to make your life easier to move from Redshift to Snowflake

https://github.com/Faire/snowflake-migration

Oda: Data as a product at Oda

Oda writes an exciting blog about “Data as a Product,” describing why we must treat data as a product, dashboard as a product, and the ownership model for data products.

https://medium.com/oda-product-tech/data-as-a-product-at-oda-fda97695e820

The blog highlights six key principles of the value creation of data.

Domain knowledge + discipline expertise
Distributed Data Ownership and shared Data Ownership
Data as a Product
Enablement over Handover
Impact through Exploration and Experimentation
Proactive attitude towards Data Privacy & Ethics

https://medium.com/oda-product-tech/the-six-principles-for-how-we-run-data-insight-at-oda-ba7185b5af39

Hubspot: Saving Millions on Logging: Finding & Delivering Savings

Sure, Storage is Cheap, but how do you define Cheap? I see a pattern where increased attention to optimizing storage cost by applying an efficient compression pattern. Uber has written about Cost Efficiency at Scale in Big Data File Format. Hubspot writes about one such saving from converting from Json log format to Snappy+ ORC.

https://product.hubspot.com/blog/savings-logging-part1

https://product.hubspot.com/blog/savings-logging-part2

Data Council - Austin 2023 Discount Code

Data Council - Austin 2023 is nearing, and I’m excited to meet all the data practitioners in person. Data Engineering Weekly readers can use the DataWeekly20 promo code to get a 20% discount on the ticket price.

Link to Register: https://www.datacouncil.ai/austin

Promo Code: DataWeekly20

Expedia: Market Segmentation for Geo-Testing at Scale

Accurately measuring the effect of digital campaigns has been affected by privacy changes initiated by Apple, a decline in third-party cookie data, increased usage of incognito browsing, information loss due to cross-device usage, and multiple touches along the customer journey. The answer, according to Meta, it’s Geo-Testing. Expedia writes about how it runs market segmentation for Geo-Testing at scale, common Geo-Testing challenges, and how to use market segmentation to resolve them.

https://medium.com/expedia-group-tech/market-segmentation-for-geo-testing-at-scale-8d593e0aa755

Indicium Engineering: audit_helper in dbt - bringing data auditing to a higher level

Evolving a model from one version to another version or migrating to another target is inevitable in the data pipeline. Indicium writes about how it uses the dbt’s audit_helper package.

https://medium.com/indiciumtech/audit-helper-in-dbt-bringing-data-auditing-to-a-higher-level-3afe0385cd5

All rights reserved ProtoGrowth Inc, India. I have provided links for informational purposes and do not suggest endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.

Data Engineering Weekly