Discover more from Data Engineering Weekly
Data Engineering Weekly #129
The Weekly Data Engineering Newsletter
Data Engineering Weekly Is Brought to You by RudderStack
RudderStack provides data pipelines that make collecting data from every application, website, and SaaS platform easy, then activating it in your warehouse and business tools. Sign up free to test out the tool today.
DoorDash identifies Five big areas for using Generative AI.
Generative AI took the industry by storm, and every company is trying to figure out what it means to them. DoorDash writes about its discovery of Generative AI and its application to boost its business.
The assistance of customers in completing tasks
Better tailored and interactive discovery [Recommendation]
Generation of personalized content and merchandising
Extraction of structured information
Enhancement of employee productivity
Checkout.com: Building dbt CI/CD at scale
Checkout.com writes about its CI/CD pipeline process on deploying dbt with over 27 dbt projects focusing on different aspects across the business, 100+ dbt developers contributing to these projects every week, and 1000+ models.
Mikkel Dengsøe: Europe data salary benchmark 2023
Fascinating findings on Europe’s data salary among various countries. The key findings are
German-based roles pay lower.
London and Dublin-based roles have the highest compensations. The Dublin sample is skewed to more senior roles, with 55% of reported salaries being senior, so this is more indicative of the sample than jobs in Dublin paying higher than London
The top 75% percentile jobs in Amsterdam, London, and Dublin pay nearly 50% more than those in Berlin
Intuit: How to Streamline ML Model Deployment? Automated Sanity Checks
It is crucial to guarantee the reliability and accuracy of ML models before deploying them to production. Intuit provides insightful information about performing a set of tests called machine learning model sanity checks in a pre-production environment. These tests aim to identify any systematic errors and biases in the models, thus helping to ensure that they function as intended when deployed to production. It follows a simple four-step process.
Ensure online model scores sink to the output store.
Rescore with the offline model.
Compare the online model score to the offline model score.
Check input data.
Sponsored: [Virtual Data Panel] Measuring Data Team ROI
As data leaders, one of our top priorities is to measure ROI. From tracking the efficacy of marketing campaigns to understanding the root cause of new spikes in user engagement, we’re tasked with keeping tabs on the business's health at all levels. But what about the ROI of our own teams? Watch a panel of data leaders as they discuss how to build strategies for measuring data team ROI.
Trivago: Implementing Data Validation with Great Expectations in Hybrid Environments
The article by Trivago discusses the integration of data validation with Great Expectations. It presents a well-balanced case study that emphasizes the significance of data validation and the necessity for sophisticated statistical validation methods.
Microsoft: Reasoning about change: Lessons learned from building a near real-time system for Azure pricing
The article discusses the difficulties of processing continuous data streams for calculating unbounded events and global states. However, I don’t fully understand the versioning solution and how the systems coordinate it. Also, is the versioning approach similar to the LakeHouse system design? 🤔
Sponsored: How We Optimized RudderStack’s Identity Resolution Algorithm for Performance
We realized we didn’t need an edge-cluster mapping table with the left and right nodes as columns. We could instead use a node-cluster mapping table and have each edge add a mini-cluster to it upon initialization.
Principle AI/ML Engineer, Justin Driemeyer, details the steps the team at RudderStack took to optimize their identity resolution algorithm for performance after an update to meet a requirement for point-in-time correct materials that resulted in unacceptably long runtimes.
Expedia: How Expedia Reviews Engineering Is Using Event Streams as a Source Of Truth
“Events as a source of truth” is a simple but powerful idea to persist the state of the business entity as a sequence of state-changing events. How to build such a system? Expedia writes about the review stream system to demonstrate how it adopted the event-first approach.
Justin Swansburg: Top Ten Mistakes Data Scientists Make While Building Churn Models
Churn prediction is vital for a company. A timely prediction can save tons of customer acquisition costs and increase customer retention. The article is an excellent reference to avoid common mistakes while making churn predictions.
WTTJ Tech: From PostgreSQL to Snowflake - A data migration story
WTTJ Tech has an interesting story to share about data migration. In this case, they moved from PostgreSQL to Snowflake instead of Redshift. The article's highlight is that it covers the technical aspects of the transition and details the full migration plan, including moving dashboards and other important elements.
Alejandro Perez: Saving 💵 With BigQuery & dbt
The current economic situation brings a refreshed look at cost control in various companies. Every 1% optimization on cost requirements should lead to celebration and acknowledgment. The author provides a few hacks while running dbt on BigQuery to save 💵.
All rights reserved ProtoGrowth Inc, India. I have provided links for informational purposes and do not suggest endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.