Data Engineering Weekly

Share this post

Data Engineering Weekly #121

www.dataengineeringweekly.com

Data Engineering Weekly #121

The Weekly Data Engineering Newsletter

Ananth Packkildurai
Mar 6
3
Share this post

Data Engineering Weekly #121

www.dataengineeringweekly.com

Data Engineering Weekly Is Brought to You by RudderStack

RudderStack provides data pipelines that make it easy to collect data from every application, website, and SaaS platform, then activate it in your warehouse and business tools. Sign up free to test out the tool today.


Editor’s Note: Who Owns the Data Contract?

Twitter avatar for @ananthdurai
at-ananth-at-data-folks dot m@ st0 dot h0st @ananthdurai
Shift Left in Data Quality: The producer should own the data quality is a misconception. Producers are Accountable for implementing quality rules, whereas the Consumers are Responsible for Data Quality.
11:14 AM ∙ Feb 20, 2023

The tweet about Data Quality & Data Contract ownership triggers an interesting conversation. I sit with David Jayatillake & Kevin Hu of Metaplane to discuss the ownership and organization dynamics.

https://www.linkedin.com/video/event/urn:li:ugcPost:7036005867741159424/

The conversation also leads to interesting conversations in the data-folks Mastro channel. Link to the conversation

https://techhub.social/@datacequia/109966388223034317

Discussion with Andrew Padilla

Stanford HAI: Generative AI - Perspectives from Stanford HAI

ChatGPT might write an essay, Midjourney could create beautiful illustrations, or MusicLM could compose a jingle. On the one hand, they may seamlessly complement human labor, making us more productive and creative; on the other, they could amplify the bias we already experience or undermine our trust in information. Stanford HAI published its perspective on Generative AI in this extensive report.

https://hai.stanford.edu/generative-ai-perspectives-stanford-hai


Meta: Four Analytics Best Practices We Adopted — and Why You Should Too

Meta writes about four analytical best practices to ensure the most trustworthy and responsible data-driven decisions across the company. The basics of the best practices are to establish Meta’s Ground Truth Maturity Framework [GTMF]

https://medium.com/@AnalyticsAtMeta/four-analytics-best-practices-we-adopted-and-why-you-should-too-a1058ce5f8af


Google: Datasets at your fingertips in Google Search

Easy access to the datasets is 80% of the problem solved in data engineering. Google provides Dataset Search, a dedicated search engine for datasets, powers this feature and indexes more than 45 million datasets from more than 13,000 websites.

https://ai.googleblog.com/2023/02/datasets-at-your-fingertips-in-google.html


Netflix: Data ingestion pipeline with Operation Management

Netflix writes about a unique challenge of its annotation pipeline: the need to support multiple runs of the same annotation tasks. In the date version partition table, we override the partition or swap the version from one bucket location to another; However, Netflix requires these outputs to be searchable and findable as soon the job is finished. The blog narrates how they overcome the challenge with the combination of Cassandra & ElasticSearch

https://netflixtechblog.medium.com/data-ingestion-pipeline-with-operation-management-3c5c638740a8


Sponsored: [New Guide] The Ultimate Guide to Data Mesh Architecture

If implementing data mesh is high on your list of priorities, you’re not alone. As organizations scale their use of data, centralized architectures can prevent data teams from keeping pace with stakeholder demands and system needs. In this guide, learn through strategies deployed by leading data teams that have successfully implemented data mesh.
Get The Guide


Peter Bruins: Some reflections on talking with Data leaders

Twitter avatar for @ananthdurai
at-ananth-at-data-folks dot m@ st0 dot h0st @ananthdurai
As the size of the organization grows, the data maturity shrink. The complexity of the data outgrown the usability the data. Have anyone seen this pattern? Curious to know data folks' thoughts on it.
6:52 PM ∙ Feb 10, 2022
4Likes3Retweets

Data Mesh/ Data Product/ Data Contract all the concepts trying to address this problem, and this is a Billion $ $ $ worth of a problem to solve. The author leaves a bigger question, Ownership plays a central role in all these concepts, but what is the incentive to bring Ownership?

https://www.linkedin.com/pulse/some-reflections-talking-data-leaders-peter-bruins/


Faire: The great migration from Redshift to Snowflake

Is Redshift dying? I’m seeing an increasing pattern of people migrating from Redshift to Snowflake or Lakehouse. Flair wrote a detailed blog on the reasoning behind Redshift to Snowflake migration, its journey, and its key takeaway.

https://craft.faire.com/the-great-migration-from-redshift-to-snowflake-173c1fb59a52

Flair also opensource some of the utility scripts to make your life easier to move from Redshift to Snowflake

https://github.com/Faire/snowflake-migration


Sponsored: Three architectures to make your data stack more efficient in 2023

RudderStack details three practical ways to make your data stack more cost-effective and your data team more efficient.

  1. Eliminating integration engineering work.

  2. Using your data warehouse instead of an expensive CDP.

  3. Using APIs, instead of expensive 3rd party services, for data enrichment.

It includes details on an interesting use of a data transformation (written in Python), Webhook, and an internal signup API to streamline app signups from their marketing site. A key highlight of the blog

Their data team leveraged a cost-effective geolocation API to solve the problem. In a RudderStack Transformation, they passed the user’s IP address to the service and appended the returned region to the payload, which was passed into a custom field in the marketing platform.The marketing team was then able to automatically segment users into regional lists and trigger location-based offers in real-time.

https://www.rudderstack.com/blog/three-architectures-to-make-your-data-stack-more-efficient-in-2023/


Oda: Data as a product at Oda

Oda writes an exciting blog about “Data as a Product,” describing why we must treat data as a product, dashboard as a product, and the ownership model for data products.

https://medium.com/oda-product-tech/data-as-a-product-at-oda-fda97695e820

The blog highlights six key principles of the value creation of data.

  1. Domain knowledge + discipline expertise

  2. Distributed Data Ownership and shared Data Ownership

  3. Data as a Product

  4. Enablement over Handover

  5. Impact through Exploration and Experimentation

  6. Proactive attitude towards Data Privacy & Ethics

https://medium.com/oda-product-tech/the-six-principles-for-how-we-run-data-insight-at-oda-ba7185b5af39


Hubspot: Saving Millions on Logging: Finding & Delivering Savings

Sure, Storage is Cheap, but how do you define Cheap? I see a pattern where increased attention to optimizing storage cost by applying an efficient compression pattern. Uber has written about Cost Efficiency at Scale in Big Data File Format. Hubspot writes about one such saving from converting from Json log format to Snappy+ ORC.

https://product.hubspot.com/blog/savings-logging-part1

https://product.hubspot.com/blog/savings-logging-part2


Data Council - Austin 2023 Discount Code

Data Council - Austin 2023 is nearing, and I’m excited to meet all the data practitioners in person. Data Engineering Weekly readers can use the DataWeekly20 promo code to get a 20% discount on the ticket price.

Link to Register: https://www.datacouncil.ai/austin

Promo Code: DataWeekly20


Expedia: Market Segmentation for Geo-Testing at Scale

Accurately measuring the effect of digital campaigns has been affected by privacy changes initiated by Apple, a decline in third-party cookie data, increased usage of incognito browsing, information loss due to cross-device usage, and multiple touches along the customer journey. The answer, according to Meta, it’s Geo-Testing. Expedia writes about how it runs market segmentation for Geo-Testing at scale, common Geo-Testing challenges, and how to use market segmentation to resolve them.

https://medium.com/expedia-group-tech/market-segmentation-for-geo-testing-at-scale-8d593e0aa755


Indicium Engineering: audit_helper in dbt - bringing data auditing to a higher level

Evolving a model from one version to another version or migrating to another target is inevitable in the data pipeline. Indicium writes about how it uses the dbt’s audit_helper package.

https://medium.com/indiciumtech/audit-helper-in-dbt-bringing-data-auditing-to-a-higher-level-3afe0385cd5


All rights reserved ProtoGrowth Inc, India. I have provided links for informational purposes and do not suggest endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.

Share this post

Data Engineering Weekly #121

www.dataengineeringweekly.com
Comments
TopNewCommunity

No posts

Ready for more?

© 2023 Ananth Packkildurai
Privacy ∙ Terms ∙ Collection notice
Start WritingGet the app
Substack is the home for great writing