Data Engineering Weekly

Share this post

Data Engineering Weekly #106

www.dataengineeringweekly.com

Data Engineering Weekly #106

The Weekly Data Engineering Newsletter

Ananth Packkildurai
Nov 7, 2022
4
1
Share this post

Data Engineering Weekly #106

www.dataengineeringweekly.com

Data Engineering Weekly Is Brought to You by RudderStack

RudderStack provides data pipelines that make it easy to collect data from every application, website, and SaaS platform, then activate it in your warehouse and business tools. Sign up free to test out the tool today.


Editor’s Note: #TwitterMigration Mastodon & the definition of Data Contract

Last week I switched from Twitter to Mastodon. Thanks to David Jayatillake for setting up data-folks.masto.host. Honestly, I was a bit suspicious, but to my surprise, it a far better than I expected. I’ve more high-quality engagement with the data folks than Twitter without distraction. If you’re a data professional, please join at data-folks.masto.host. I’m at ananth@data-folks.masto.host. I’m following most of the data professionals, so you can easily build your network from my following list.


Top of Mind on Data Contract

Last two weeks, a few data folks reached out to me about Data Contract and what it is. The term “Contract” is always a source of confusion. People think of “Contract” in a traditional term as static and bureaucratic. I often use the term “Schema Ops” for this very reason. Here is my definition of a Data Contract

A data contract/ Schema Ops is not static or a one-time task. The data contract flow originated from the data producer. As the adoption grows, the consumers start amending expectations and expect enrichment on their contracts. A data contract is a continuous and collaborative system because the business context and requirements won’t be static.

I plan to write a series of blogs on Schemata and Data Contract in the coming weeks. I know I told you this before, so George R. R. Martin kindly stepped in for me to give the update for my promised blog posts.


Uber: How Uber Optimizes the Timing of Push Notifications using ML and Linear Programming

In-House notifications are a significant lead generator for online commerce. Uber writes about the complexity of the problem statement and how it adopted the linear program (linear optimization) to achieve the best outcome.

https://www.uber.com/en-US/blog/how-uber-optimizes-push-notifications-using-ml/


Meta: Improving Instagram notification management with machine learning and causal inference

Meta writes about a similar system of improving notification with ML. The blog discusses the tradeoff between the user experience and the CTR model for notification and the adoption of a causal inference model for notification management systems.

https://engineering.fb.com/2022/10/31/ml-applications/instagram-notification-management-machine-learning/


Pinterest: How Pinterest Leverages Realtime User Actions in Recommendation to Boost Homefeed Engagement Volume

Looping in real-time user interaction events with the recommendation engine can significantly improve the user experience. Pinterest writes one such system for their Homefeed and how it leverages real-time user actions in the recommendation to boost Homefeed engagement volume.

https://medium.com/pinterest-engineering/how-pinterest-leverages-realtime-user-actions-in-recommendation-to-boost-homefeed-engagement-volume-165ae2e8cde8


eBay: Increase A/B Testing Power by Combining Experiments

eBay writes about its adoption of the weighted z-test, which can combine readouts (including p-values, lift, CI, etc.) from multiple independent experiments for the same hypothesis. I’m looking forward to reading more on this topic to learn more.

https://tech.ebayinc.com/engineering/increase-a-b-testing-power-by-combining-experiments/


Sponsored: [New eBook] The Ultimate Data Observability Platform Evaluation Guide

Considering investing in a data quality solution? Before you add another tool to your data stack, check out our latest guide for 10 things to consider when evaluating data observability platforms, including scalability, time to value, and ease of setup.

Access You Free Copy for Data Engineering Weekly Readers


Trivago: Explore-exploit dilemma in the Ranking model

A fascinating read of the week about the Explore-Exploit dilemma in the ranking model.

The problem in the context of Trivago as Exploitation means showing users accommodations that have historically performed well. Exploration means showing accommodations that have never been shown to the user, with the hope of finding those that will perform better than those currently shown.

Trivago concludes that one can overcome this by combining classical approaches to exploration with model-based approaches to systematically identify the most promising inventory in the unknown pool.

https://tech.trivago.com/post/2022-11-04-explore-exploit-dilemma-in-ranking-model/


Microsoft: How well do you know your Machine Learning models

Machine Learning increasingly occupies important decisions in our lives, from credit scores to loan approval to where to eat and shop. But How well do we know the Machine Learning models?

Machine Learning (ML) model explainability is analyzing and surfacing the inner workings of a Machine Learning model or other "black box" algorithms to make them more transparent.

The blog narrates how Azure InterpretML service can help to understand the ML models' predictions better.

Part 1: https://medium.com/data-science-at-microsoft/how-well-do-you-know-your-machine-learning-models-part-1-of-2-35979512ceba

Part 2: https://medium.com/data-science-at-microsoft/how-well-do-you-know-your-machine-learning-models-part-2-of-2-c36e8184bab4


Sponsored: How HealthMatch built a HIPAA-compliant data stack with RudderStack and Customer.io

Learn how HealthMatch built a HIPAA-compliant data stack with Customer.io and RudderStack to reduce reliance on developers for messaging use cases. After only a week of implementation time, their team launched a targeted SMS campaign with the new stack that drove $130k in revenue within 24 hours. Register today and join live on Wednesday 11/9, at 12PT / 3ET.

https://www.rudderstack.com/events/how-healthmatchio-used-customerio-and-rudderstack-to-launch-their-new-business-model-in-24-hours/


Shailey Dash: Decision Trees Explained — Entropy, Information Gain, Gini Index, CCP Pruning

Continuing our quest to learn more about the ML model, The author writes about how the Decision Tree works. Though Decision Trees look simple and intuitive, there is nothing straightforward about how the algorithm decides on splits and how tree pruning occurs. I learned a ton from this article.

https://towardsdatascience.com/decision-trees-explained-entropy-information-gain-gini-index-ccp-pruning-4d78070db36c


Abi Aryan: This has been such an excellent year for software system design in ML

It is indeed an excellent year for software system design for ML; as you noticed in this week's edition, most of the article discusses ML system design. The author compiled some exciting papers on MLOps and is looking forward to reading more of these papers.

https://datadrivenbabe.substack.com/p/this-has-been-such-an-excellent-year


The Eternal Suffering of Data Practitioners: Part 1

As a data practitioner, It is inevitable to expose to a stream of requests from various stakeholders. How should one approach it systematically to elevate data function and improve customer satisfaction? The author gives some valuable strategies on the same.

https://pedram.substack.com/p/the-eternal-suffering-of-data-practitioners


Inventa: How we slimmed down Slim CI for dbt Cloud

There is always flakiness in adopting any solutions that require further optimization. Inventa writes about such optimization challenges with dbt cloud's CI/ CD system and how it optimized it. TIL about Slim CI, and looking forward to reading more about it.

https://medium.com/building-inventa/how-we-slimmed-down-slim-ci-for-dbt-cloud-6a944e7574e2


Yousign: Snowflake RBAC Implementation with Permifrost

Identity and access management is a critical need for the data infrastructure. There is a need for a lightweight solution in this space, and delighted to see Permifrost from Gitlab. Yousign's team writes about how it adopted Permifrost with its infrastructure.

https://medium.com/yousign-engineering-product/snowflake-rbac-implementation-with-permifrost-3d30652825ad


Zapr: How We Enhanced Productivity of Zapr’s Data Platform and Saved Costs

Of all criticism about Hadoop, Hive, and its ecosystem, one thing it got correct is the Hive metastore. Every data processing engine has one metadata store to integrate. The cloud data warehouses and LakeHouse systems have broken that promise ever since, and it is a constant struggle to sync metadata across different systems.

Zapr talks about one such challenge with Hive metastore and Glue catalog and its approach to bringing efficiency.

https://kpskarthick1.medium.com/how-we-enhanced-productivity-of-zaprs-data-platform-and-saved-costs-5ab5f3a42aa8


All rights reserved Pixel Impex Inc, India. Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.

1
Share this post

Data Engineering Weekly #106

www.dataengineeringweekly.com
1 Comment
Nitin Khaitan
Writes Nitin’s Newsletter
Nov 11, 2022

Thanks for sharing your thought about the data-driven organisation.

Below is a link to a good article about making a data-driven organisation: https://medium.com/towards-polyglot-architecture/design-thinking-toward-data-driven-organisation-473060f44feb

Could you share your thoughts as well?

Expand full comment
Reply
TopNewCommunity

No posts

Ready for more?

© 2023 Ananth Packkildurai
Privacy ∙ Terms ∙ Collection notice
Start WritingGet the app
Substack is the home for great writing