Data Engineering Weekly #106

The Weekly Data Engineering Newsletter

Nov 07, 2022

Data Engineering Weekly Is Brought to You by RudderStack

RudderStack provides data pipelines that make it easy to collect data from every application, website, and SaaS platform, then activate it in your warehouse and business tools. Sign up free to test out the tool today.

Editor’s Note: #TwitterMigration Mastodon & the definition of Data Contract

Last week I switched from Twitter to Mastodon. Thanks to David Jayatillake for setting up data-folks.masto.host. Honestly, I was a bit suspicious, but to my surprise, it a far better than I expected. I’ve more high-quality engagement with the data folks than Twitter without distraction. If you’re a data professional, please join at data-folks.masto.host. I’m at ananth@data-folks.masto.host. I’m following most of the data professionals, so you can easily build your network from my following list.

Top of Mind on Data Contract

Last two weeks, a few data folks reached out to me about Data Contract and what it is. The term “Contract” is always a source of confusion. People think of “Contract” in a traditional term as static and bureaucratic. I often use the term “Schema Ops” for this very reason. Here is my definition of a Data Contract

A data contract/ Schema Ops is not static or a one-time task. The data contract flow originated from the data producer. As the adoption grows, the consumers start amending expectations and expect enrichment on their contracts. A data contract is a continuous and collaborative system because the business context and requirements won’t be static.

I plan to write a series of blogs on Schemata and Data Contract in the coming weeks. I know I told you this before, so George R. R. Martin kindly stepped in for me to give the update for my promised blog posts.

Uber: How Uber Optimizes the Timing of Push Notifications using ML and Linear Programming

In-House notifications are a significant lead generator for online commerce. Uber writes about the complexity of the problem statement and how it adopted the linear program (linear optimization) to achieve the best outcome.

https://www.uber.com/en-US/blog/how-uber-optimizes-push-notifications-using-ml/

Meta: Improving Instagram notification management with machine learning and causal inference

Meta writes about a similar system of improving notification with ML. The blog discusses the tradeoff between the user experience and the CTR model for notification and the adoption of a causal inference model for notification management systems.

https://engineering.fb.com/2022/10/31/ml-applications/instagram-notification-management-machine-learning/

Pinterest: How Pinterest Leverages Realtime User Actions in Recommendation to Boost Homefeed Engagement Volume

Looping in real-time user interaction events with the recommendation engine can significantly improve the user experience. Pinterest writes one such system for their Homefeed and how it leverages real-time user actions in the recommendation to boost Homefeed engagement volume.

https://medium.com/pinterest-engineering/how-pinterest-leverages-realtime-user-actions-in-recommendation-to-boost-homefeed-engagement-volume-165ae2e8cde8

eBay: Increase A/B Testing Power by Combining Experiments

eBay writes about its adoption of the weighted z-test, which can combine readouts (including p-values, lift, CI, etc.) from multiple independent experiments for the same hypothesis. I’m looking forward to reading more on this topic to learn more.

https://tech.ebayinc.com/engineering/increase-a-b-testing-power-by-combining-experiments/

Trivago: Explore-exploit dilemma in the Ranking model

A fascinating read of the week about the Explore-Exploit dilemma in the ranking model.

The problem in the context of Trivago as Exploitation means showing users accommodations that have historically performed well. Exploration means showing accommodations that have never been shown to the user, with the hope of finding those that will perform better than those currently shown.

Trivago concludes that one can overcome this by combining classical approaches to exploration with model-based approaches to systematically identify the most promising inventory in the unknown pool.

https://tech.trivago.com/post/2022-11-04-explore-exploit-dilemma-in-ranking-model/

Microsoft: How well do you know your Machine Learning models

Machine Learning increasingly occupies important decisions in our lives, from credit scores to loan approval to where to eat and shop. But How well do we know the Machine Learning models?

Machine Learning (ML) model explainability is analyzing and surfacing the inner workings of a Machine Learning model or other "black box" algorithms to make them more transparent.

The blog narrates how Azure InterpretML service can help to understand the ML models' predictions better.

Part 1: https://medium.com/data-science-at-microsoft/how-well-do-you-know-your-machine-learning-models-part-1-of-2-35979512ceba

Part 2: https://medium.com/data-science-at-microsoft/how-well-do-you-know-your-machine-learning-models-part-2-of-2-c36e8184bab4

Shailey Dash: Decision Trees Explained — Entropy, Information Gain, Gini Index, CCP Pruning

Continuing our quest to learn more about the ML model, The author writes about how the Decision Tree works. Though Decision Trees look simple and intuitive, there is nothing straightforward about how the algorithm decides on splits and how tree pruning occurs. I learned a ton from this article.

https://towardsdatascience.com/decision-trees-explained-entropy-information-gain-gini-index-ccp-pruning-4d78070db36c

Abi Aryan: This has been such an excellent year for software system design in ML

It is indeed an excellent year for software system design for ML; as you noticed in this week's edition, most of the article discusses ML system design. The author compiled some exciting papers on MLOps and is looking forward to reading more of these papers.

https://datadrivenbabe.substack.com/p/this-has-been-such-an-excellent-year

The Eternal Suffering of Data Practitioners: Part 1

As a data practitioner, It is inevitable to expose to a stream of requests from various stakeholders. How should one approach it systematically to elevate data function and improve customer satisfaction? The author gives some valuable strategies on the same.

https://pedram.substack.com/p/the-eternal-suffering-of-data-practitioners

Inventa: How we slimmed down Slim CI for dbt Cloud

There is always flakiness in adopting any solutions that require further optimization. Inventa writes about such optimization challenges with dbt cloud's CI/ CD system and how it optimized it. TIL about Slim CI, and looking forward to reading more about it.

https://medium.com/building-inventa/how-we-slimmed-down-slim-ci-for-dbt-cloud-6a944e7574e2

Yousign: Snowflake RBAC Implementation with Permifrost

Identity and access management is a critical need for the data infrastructure. There is a need for a lightweight solution in this space, and delighted to see Permifrost from Gitlab. Yousign's team writes about how it adopted Permifrost with its infrastructure.

https://medium.com/yousign-engineering-product/snowflake-rbac-implementation-with-permifrost-3d30652825ad

Zapr: How We Enhanced Productivity of Zapr’s Data Platform and Saved Costs

Of all criticism about Hadoop, Hive, and its ecosystem, one thing it got correct is the Hive metastore. Every data processing engine has one metadata store to integrate. The cloud data warehouses and LakeHouse systems have broken that promise ever since, and it is a constant struggle to sync metadata across different systems.

Zapr talks about one such challenge with Hive metastore and Glue catalog and its approach to bringing efficiency.

https://kpskarthick1.medium.com/how-we-enhanced-productivity-of-zaprs-data-platform-and-saved-costs-5ab5f3a42aa8

All rights reserved Pixel Impex Inc, India. Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.

Nitin Khaitan

Nov 11, 2022

Thanks for sharing your thought about the data-driven organisation.

Below is a link to a good article about making a data-driven organisation: https://medium.com/towards-polyglot-architecture/design-thinking-toward-data-driven-organisation-473060f44feb

Could you share your thoughts as well?

Expand full comment

Data Engineering Weekly