Data Engineering Weekly #129

The Weekly Data Engineering Newsletter

May 01, 2023

Data Engineering Weekly Is Brought to You by RudderStack

RudderStack provides data pipelines that make collecting data from every application, website, and SaaS platform easy, then activating it in your warehouse and business tools. Sign up free to test out the tool today.

DoorDash identifies Five big areas for using Generative AI.

Generative AI took the industry by storm, and every company is trying to figure out what it means to them. DoorDash writes about its discovery of Generative AI and its application to boost its business.

The assistance of customers in completing tasks
Better tailored and interactive discovery [Recommendation]
Generation of personalized content and merchandising
Extraction of structured information
Enhancement of employee productivity

https://doordash.engineering/2023/04/26/doordash-identifies-five-big-areas-for-using-generative-ai/

Checkout.com: Building dbt CI/CD at scale

Checkout.com writes about its CI/CD pipeline process on deploying dbt with over 27 dbt projects focusing on different aspects across the business, 100+ dbt developers contributing to these projects every week, and 1000+ models.

https://medium.com/checkout-com-techblog/building-dbt-ci-cd-at-scale-365358f64b6f

Mikkel Dengsøe: Europe data salary benchmark 2023

Fascinating findings on Europe’s data salary among various countries. The key findings are

German-based roles pay lower.
London and Dublin-based roles have the highest compensations. The Dublin sample is skewed to more senior roles, with 55% of reported salaries being senior, so this is more indicative of the sample than jobs in Dublin paying higher than London
The top 75% percentile jobs in Amsterdam, London, and Dublin pay nearly 50% more than those in Berlin

https://medium.com/@mikldd/europe-data-salary-benchmark-2023-b68cea57923d

Intuit: How to Streamline ML Model Deployment? Automated Sanity Checks

It is crucial to guarantee the reliability and accuracy of ML models before deploying them to production. Intuit provides insightful information about performing a set of tests called machine learning model sanity checks in a pre-production environment. These tests aim to identify any systematic errors and biases in the models, thus helping to ensure that they function as intended when deployed to production. It follows a simple four-step process.

Ensure online model scores sink to the output store.
Rescore with the offline model.
Compare the online model score to the offline model score.
Check input data.

https://medium.com/intuit-engineering/how-to-streamline-ml-model-deployment-automated-sanity-checks-64a23166fdc5

Trivago: Implementing Data Validation with Great Expectations in Hybrid Environments

The article by Trivago discusses the integration of data validation with Great Expectations. It presents a well-balanced case study that emphasizes the significance of data validation and the necessity for sophisticated statistical validation methods.

https://tech.trivago.com/post/2023-04-25-implementing-data-validation-with-great-expectations-in-hybrid-environments.html

Microsoft: Reasoning about change: Lessons learned from building a near real-time system for Azure pricing

The article discusses the difficulties of processing continuous data streams for calculating unbounded events and global states. However, I don’t fully understand the versioning solution and how the systems coordinate it. Also, is the versioning approach similar to the LakeHouse system design? 🤔

https://medium.com/data-science-at-microsoft/reasoning-about-change-lessons-learned-from-building-a-near-real-time-system-for-azure-pricing-34049816ffbd

Expedia: How Expedia Reviews Engineering Is Using Event Streams as a Source Of Truth

“Events as a source of truth” is a simple but powerful idea to persist the state of the business entity as a sequence of state-changing events. How to build such a system? Expedia writes about the review stream system to demonstrate how it adopted the event-first approach.

https://medium.com/expedia-group-tech/how-expedia-reviews-engineering-is-using-event-streams-as-a-source-of-truth-d3df616cccd8

Justin Swansburg: Top Ten Mistakes Data Scientists Make While Building Churn Models

Churn prediction is vital for a company. A timely prediction can save tons of customer acquisition costs and increase customer retention. The article is an excellent reference to avoid common mistakes while making churn predictions.

https://medium.com/@swansburg.justin/top-ten-mistakes-data-scientists-make-while-building-churn-models-d773bb7deaa5

WTTJ Tech: From PostgreSQL to Snowflake - A data migration story

WTTJ Tech has an interesting story to share about data migration. In this case, they moved from PostgreSQL to Snowflake instead of Redshift. The article's highlight is that it covers the technical aspects of the transition and details the full migration plan, including moving dashboards and other important elements.

https://medium.com/wttj-tech/from-postgresql-to-snowflake-a-data-migration-story-5fd17f778019

Alejandro Perez: Saving 💵 With BigQuery & dbt

The current economic situation brings a refreshed look at cost control in various companies. Every 1% optimization on cost requirements should lead to celebration and acknowledgment. The author provides a few hacks while running dbt on BigQuery to save 💵.

https://medium.com/@alexroperez4/saving-with-bigquery-dbt-35937b1cf628

All rights reserved ProtoGrowth Inc, India. I have provided links for informational purposes and do not suggest endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.

Data Engineering Weekly