Data Engineering Weekly Is Brought to You by RudderStack
RudderStack provides data pipelines that make it easy to collect data from every application, website, and SaaS platform, then activate it in your warehouse and business tools. Sign up free to test out the tool today.
Editor’s Note: Who Owns the Data Contract?
The tweet about Data Quality & Data Contract ownership triggers an interesting conversation. I sit with David Jayatillake & Kevin Hu of Metaplane to discuss the ownership and organization dynamics.
https://www.linkedin.com/video/event/urn:li:ugcPost:7036005867741159424/
The conversation also leads to interesting conversations in the data-folks Mastro channel. Link to the conversation
https://techhub.social/@datacequia/109966388223034317
Stanford HAI: Generative AI - Perspectives from Stanford HAI
ChatGPT might write an essay, Midjourney could create beautiful illustrations, or MusicLM could compose a jingle. On the one hand, they may seamlessly complement human labor, making us more productive and creative; on the other, they could amplify the bias we already experience or undermine our trust in information. Stanford HAI published its perspective on Generative AI in this extensive report.
https://hai.stanford.edu/generative-ai-perspectives-stanford-hai
Meta: Four Analytics Best Practices We Adopted — and Why You Should Too
Meta writes about four analytical best practices to ensure the most trustworthy and responsible data-driven decisions across the company. The basics of the best practices are to establish Meta’s Ground Truth Maturity Framework [GTMF]
Google: Datasets at your fingertips in Google Search
Easy access to the datasets is 80% of the problem solved in data engineering. Google provides Dataset Search, a dedicated search engine for datasets, powers this feature and indexes more than 45 million datasets from more than 13,000 websites.
https://ai.googleblog.com/2023/02/datasets-at-your-fingertips-in-google.html
Netflix: Data ingestion pipeline with Operation Management
Netflix writes about a unique challenge of its annotation pipeline: the need to support multiple runs of the same annotation tasks. In the date version partition table, we override the partition or swap the version from one bucket location to another; However, Netflix requires these outputs to be searchable and findable as soon the job is finished. The blog narrates how they overcome the challenge with the combination of Cassandra & ElasticSearch
https://netflixtechblog.medium.com/data-ingestion-pipeline-with-operation-management-3c5c638740a8
Sponsored: [New Guide] The Ultimate Guide to Data Mesh Architecture
If implementing data mesh is high on your list of priorities, you’re not alone. As organizations scale their use of data, centralized architectures can prevent data teams from keeping pace with stakeholder demands and system needs. In this guide, learn through strategies deployed by leading data teams that have successfully implemented data mesh.
Get The Guide
Peter Bruins: Some reflections on talking with Data leaders
Data Mesh/ Data Product/ Data Contract all the concepts trying to address this problem, and this is a Billion $ $ $ worth of a problem to solve. The author leaves a bigger question, Ownership plays a central role in all these concepts, but what is the incentive to bring Ownership?
https://www.linkedin.com/pulse/some-reflections-talking-data-leaders-peter-bruins/
Faire: The great migration from Redshift to Snowflake
Is Redshift dying? I’m seeing an increasing pattern of people migrating from Redshift to Snowflake or Lakehouse. Flair wrote a detailed blog on the reasoning behind Redshift to Snowflake migration, its journey, and its key takeaway.
https://craft.faire.com/the-great-migration-from-redshift-to-snowflake-173c1fb59a52
Flair also opensource some of the utility scripts to make your life easier to move from Redshift to Snowflake
https://github.com/Faire/snowflake-migration
Sponsored: Three architectures to make your data stack more efficient in 2023
RudderStack details three practical ways to make your data stack more cost-effective and your data team more efficient.
Eliminating integration engineering work.
Using your data warehouse instead of an expensive CDP.
Using APIs, instead of expensive 3rd party services, for data enrichment.
It includes details on an interesting use of a data transformation (written in Python), Webhook, and an internal signup API to streamline app signups from their marketing site. A key highlight of the blog
Their data team leveraged a cost-effective geolocation API to solve the problem. In a RudderStack Transformation, they passed the user’s IP address to the service and appended the returned region to the payload, which was passed into a custom field in the marketing platform.The marketing team was then able to automatically segment users into regional lists and trigger location-based offers in real-time.
https://www.rudderstack.com/blog/three-architectures-to-make-your-data-stack-more-efficient-in-2023/
Oda: Data as a product at Oda
Oda writes an exciting blog about “Data as a Product,” describing why we must treat data as a product, dashboard as a product, and the ownership model for data products.
https://medium.com/oda-product-tech/data-as-a-product-at-oda-fda97695e820
The blog highlights six key principles of the value creation of data.
Domain knowledge + discipline expertise
Distributed Data Ownership and shared Data Ownership
Data as a Product
Enablement over Handover
Impact through Exploration and Experimentation
Proactive attitude towards Data Privacy & Ethics
Hubspot: Saving Millions on Logging: Finding & Delivering Savings
Sure, Storage is Cheap, but how do you define Cheap? I see a pattern where increased attention to optimizing storage cost by applying an efficient compression pattern. Uber has written about Cost Efficiency at Scale in Big Data File Format. Hubspot writes about one such saving from converting from Json log format to Snappy+ ORC.
https://product.hubspot.com/blog/savings-logging-part1
https://product.hubspot.com/blog/savings-logging-part2
Data Council - Austin 2023 Discount Code
Data Council - Austin 2023 is nearing, and I’m excited to meet all the data practitioners in person. Data Engineering Weekly readers can use the DataWeekly20 promo code to get a 20% discount on the ticket price.
Link to Register: https://www.datacouncil.ai/austin
Promo Code: DataWeekly20
Expedia: Market Segmentation for Geo-Testing at Scale
Accurately measuring the effect of digital campaigns has been affected by privacy changes initiated by Apple, a decline in third-party cookie data, increased usage of incognito browsing, information loss due to cross-device usage, and multiple touches along the customer journey. The answer, according to Meta, it’s Geo-Testing. Expedia writes about how it runs market segmentation for Geo-Testing at scale, common Geo-Testing challenges, and how to use market segmentation to resolve them.
https://medium.com/expedia-group-tech/market-segmentation-for-geo-testing-at-scale-8d593e0aa755
Indicium Engineering: audit_helper in dbt - bringing data auditing to a higher level
Evolving a model from one version to another version or migrating to another target is inevitable in the data pipeline. Indicium writes about how it uses the dbt’s audit_helper package.
All rights reserved ProtoGrowth Inc, India. I have provided links for informational purposes and do not suggest endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.