Data Engineering Weekly #84

The Weekly Data Engineering Newsletter

Apr 25, 2022

Data Engineering Weekly Is Brought to You by RudderStack

RudderStack provides data pipelines that make it easy to collect data from every application, website, and SaaS platform, then activate it in your warehouse and business tools. Sign up free to test out the tool today.

Ananth Packkildurai: Back To The Future - Emerging Trends In Data Engineering

I gave a talk about the emerging trends in data engineering last October at the CrunchConf. The video got published now.

Speaker Deck : https://speakerdeck.com/vananth22/back-to-the-future-emerging-trends-in-data-engineering

Meta: Inside Meta's AI optimization platform for engineers across the company

Meta writes about Looper, An AI platform to support the complete machine learning lifecycle from model training, deployment, and inference all the way to evaluation and tuning of products.

Hello, again Bundling vs. UnBundling

A coupling of things stands out in the blog,

It is a declarative AI system, which means that product engineers only need to declare the functionality they want. The system fills in the software implementation based on the declaration.
While other AI platforms often perform inference offline in batch mode, Looper operates in real-time.

https://ai.facebook.com/blog/looper-meta-ai-optimization-platform-for-engineers/

Lyft: Challenges in Experimentation

Customers, competitors, and the economy's direction are unpredictable in their own way. Experimentation is vital for testing the product change to build evidence to drive significant decisions. Lyft writes an exciting blog on the challenges of supporting the culture of experimentation.

https://eng.lyft.com/challenges-in-experimentation-be9ab98a7ef4

dbt labs: A Good Problem to Have…

The scheduler is a core part of data transformation. dbt writes about the scalability challenges with dbt and the recent improvements. I'm looking forward to part 2 of this to understand dbt cloud scheduler more!!.

https://www.getdbt.com/blog/a-good-problem-to-have/

Zalando: Machine Learning Platform - Architecture and tooling behind machine learning at Zalando

Zalando writes about the architecture and tooling behind its ML platform. The ZFlow on top of the AWS step function and the custom web interface on top of Backstage looks interesting.

https://engineering.zalando.com/posts/2022/04/zalando-machine-learning-platform.html

DoorDash: Building the Model Behind DoorDash’s Expansive Merchant Selection

DoorDash writes about its expansive merchant selection to onboard high-value merchants to ensure the selection in every market matches customer demand. The model strategy to train the customer preference to the merchant onboard looks interesting, but I wonder how the team maintains algorithm fairness? Any potential AI bias can lead to social imbalance, but the blog does not mention how it handles algorithm fairness.

https://doordash.engineering/2022/04/19/building-merchant-selection/

Blinkit: Evolution of Redash at Blinkit

Blinkit writes about its usage of Redash and narrates the challenges of running the SQL dashboarding tools and how Blinkit effectively solved them.

https://lambda.blinkit.com/evolution-of-redash-at-blinkit-fb50a64770bf

Mikkel Dengsøe: Data tests and the broken windows theory

Building trust in data in an organization is the most crucial function of a data team. The author compares the broken window theory with the data testing function.

https://mikkeldengsoe.substack.com/p/broken-windows

Lil’Log: Learning with not Enough Data

A perfect labeled data is often hard to achieve with cost and the human effort involved. Yet, label data is critical for the supervised learning task. The author discusses the approaches to take when there is not enough labeled data in a three-part series.

Learning with not Enough Data Part 1: Semi-Supervised Learning

Learning with not Enough Data Part 2: Active Learning

Learning with not Enough Data Part 3: Data Generation

Booking.com: Overtracking and trigger analysis - reducing sample sizes while INCREASING the sensitivity of experiments

An exciting article from booking.com discussing the danger of tracking users who can't be in the treatment category (called overtracking) affects the variance of the experimentation metrics and dilutes the treatment effect, making its detection harder.

https://booking.ai/overtracking-and-trigger-analysis-how-to-reduce-sample-sizes-and-increase-the-sensitivity-of-71755bad0e5f

Meryam Bukhari: What's the role of an ML PM?

Many companies adopt the product over project strategy and treat the internal platform as a product. The author discusses the role of a product manager in building ML-based products.

https://meryam.substack.com/p/whats-the-role-of-an-ml-pm

All rights reserved Pixel Impex Inc, India. Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.

Data Engineering Weekly

Discussion about this post

Ready for more?