Data Engineering Weekly #106
The Weekly Data Engineering Newsletter
Data Engineering Weekly Is Brought to You by RudderStack
RudderStack provides data pipelines that make it easy to collect data from every application, website, and SaaS platform, then activate it in your warehouse and business tools. Sign up free to test out the tool today.
Editor’s Note: #TwitterMigration Mastodon & the definition of Data Contract
Last week I switched from Twitter to Mastodon. Thanks to David Jayatillake for setting up data-folks.masto.host. Honestly, I was a bit suspicious, but to my surprise, it a far better than I expected. I’ve more high-quality engagement with the data folks than Twitter without distraction. If you’re a data professional, please join at data-folks.masto.host. I’m at firstname.lastname@example.org. I’m following most of the data professionals, so you can easily build your network from my following list.
Top of Mind on Data Contract
Last two weeks, a few data folks reached out to me about Data Contract and what it is. The term “Contract” is always a source of confusion. People think of “Contract” in a traditional term as static and bureaucratic. I often use the term “Schema Ops” for this very reason. Here is my definition of a Data Contract
A data contract/ Schema Ops is not static or a one-time task. The data contract flow originated from the data producer. As the adoption grows, the consumers start amending expectations and expect enrichment on their contracts. A data contract is a continuous and collaborative system because the business context and requirements won’t be static.
I plan to write a series of blogs on Schemata and Data Contract in the coming weeks. I know I told you this before, so George R. R. Martin kindly stepped in for me to give the update for my promised blog posts.
Uber: How Uber Optimizes the Timing of Push Notifications using ML and Linear Programming
In-House notifications are a significant lead generator for online commerce. Uber writes about the complexity of the problem statement and how it adopted the linear program (linear optimization) to achieve the best outcome.
Meta: Improving Instagram notification management with machine learning and causal inference
Meta writes about a similar system of improving notification with ML. The blog discusses the tradeoff between the user experience and the CTR model for notification and the adoption of a causal inference model for notification management systems.
Pinterest: How Pinterest Leverages Realtime User Actions in Recommendation to Boost Homefeed Engagement Volume
Looping in real-time user interaction events with the recommendation engine can significantly improve the user experience. Pinterest writes one such system for their Homefeed and how it leverages real-time user actions in the recommendation to boost Homefeed engagement volume.
eBay: Increase A/B Testing Power by Combining Experiments
eBay writes about its adoption of the weighted z-test, which can combine readouts (including p-values, lift, CI, etc.) from multiple independent experiments for the same hypothesis. I’m looking forward to reading more on this topic to learn more.
Sponsored: [New eBook] The Ultimate Data Observability Platform Evaluation Guide
Considering investing in a data quality solution? Before you add another tool to your data stack, check out our latest guide for 10 things to consider when evaluating data observability platforms, including scalability, time to value, and ease of setup.
Access You Free Copy for Data Engineering Weekly Readers
Trivago: Explore-exploit dilemma in the Ranking model
A fascinating read of the week about the Explore-Exploit dilemma in the ranking model.
The problem in the context of Trivago as Exploitation means showing users accommodations that have historically performed well. Exploration means showing accommodations that have never been shown to the user, with the hope of finding those that will perform better than those currently shown.
Trivago concludes that one can overcome this by combining classical approaches to exploration with model-based approaches to systematically identify the most promising inventory in the unknown pool.
Microsoft: How well do you know your Machine Learning models
Machine Learning increasingly occupies important decisions in our lives, from credit scores to loan approval to where to eat and shop. But How well do we know the Machine Learning models?
Machine Learning (ML) model explainability is analyzing and surfacing the inner workings of a Machine Learning model or other "black box" algorithms to make them more transparent.
The blog narrates how Azure InterpretML service can help to understand the ML models' predictions better.
Part 1: https://medium.com/data-science-at-microsoft/how-well-do-you-know-your-machine-learning-models-part-1-of-2-35979512ceba
Part 2: https://medium.com/data-science-at-microsoft/how-well-do-you-know-your-machine-learning-models-part-2-of-2-c36e8184bab4
Sponsored: How HealthMatch built a HIPAA-compliant data stack with RudderStack and Customer.io
Learn how HealthMatch built a HIPAA-compliant data stack with Customer.io and RudderStack to reduce reliance on developers for messaging use cases. After only a week of implementation time, their team launched a targeted SMS campaign with the new stack that drove $130k in revenue within 24 hours. Register today and join live on Wednesday 11/9, at 12PT / 3ET.
Shailey Dash: Decision Trees Explained — Entropy, Information Gain, Gini Index, CCP Pruning
Continuing our quest to learn more about the ML model, The author writes about how the Decision Tree works. Though Decision Trees look simple and intuitive, there is nothing straightforward about how the algorithm decides on splits and how tree pruning occurs. I learned a ton from this article.
Abi Aryan: This has been such an excellent year for software system design in ML
It is indeed an excellent year for software system design for ML; as you noticed in this week's edition, most of the article discusses ML system design. The author compiled some exciting papers on MLOps and is looking forward to reading more of these papers.
The Eternal Suffering of Data Practitioners: Part 1
As a data practitioner, It is inevitable to expose to a stream of requests from various stakeholders. How should one approach it systematically to elevate data function and improve customer satisfaction? The author gives some valuable strategies on the same.
Inventa: How we slimmed down Slim CI for dbt Cloud
There is always flakiness in adopting any solutions that require further optimization. Inventa writes about such optimization challenges with dbt cloud's CI/ CD system and how it optimized it. TIL about Slim CI, and looking forward to reading more about it.
Yousign: Snowflake RBAC Implementation with Permifrost
Identity and access management is a critical need for the data infrastructure. There is a need for a lightweight solution in this space, and delighted to see Permifrost from Gitlab. Yousign's team writes about how it adopted Permifrost with its infrastructure.
Zapr: How We Enhanced Productivity of Zapr’s Data Platform and Saved Costs
Of all criticism about Hadoop, Hive, and its ecosystem, one thing it got correct is the Hive metastore. Every data processing engine has one metadata store to integrate. The cloud data warehouses and LakeHouse systems have broken that promise ever since, and it is a constant struggle to sync metadata across different systems.
Zapr talks about one such challenge with Hive metastore and Glue catalog and its approach to bringing efficiency.
All rights reserved Pixel Impex Inc, India. Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.
Thanks for sharing your thought about the data-driven organisation.
Below is a link to a good article about making a data-driven organisation: https://medium.com/towards-polyglot-architecture/design-thinking-toward-data-driven-organisation-473060f44feb
Could you share your thoughts as well?