Data Engineering Weekly Is Brought to You by RudderStack
RudderStack provides data pipelines that make it easy to collect data from every application, website, and SaaS platform, then activate it in your warehouse and business tools. Sign up free to test out the tool today.
Ananth Packkildurai: Back To The Future - Emerging Trends In Data Engineering
I gave a talk about the emerging trends in data engineering last October at the CrunchConf. The video got published now.
Speaker Deck : https://speakerdeck.com/vananth22/back-to-the-future-emerging-trends-in-data-engineering
Meta: Inside Meta's AI optimization platform for engineers across the company
Meta writes about Looper, An AI platform to support the complete machine learning lifecycle from model training, deployment, and inference all the way to evaluation and tuning of products.
Hello, again Bundling vs. UnBundling
A coupling of things stands out in the blog,
It is a declarative AI system, which means that product engineers only need to declare the functionality they want. The system fills in the software implementation based on the declaration.
While other AI platforms often perform inference offline in batch mode, Looper operates in real-time.
https://ai.facebook.com/blog/looper-meta-ai-optimization-platform-for-engineers/
Lyft: Challenges in Experimentation
Customers, competitors, and the economy's direction are unpredictable in their own way. Experimentation is vital for testing the product change to build evidence to drive significant decisions. Lyft writes an exciting blog on the challenges of supporting the culture of experimentation.
https://eng.lyft.com/challenges-in-experimentation-be9ab98a7ef4
dbt labs: A Good Problem to Have…
The scheduler is a core part of data transformation. dbt writes about the scalability challenges with dbt and the recent improvements. I'm looking forward to part 2 of this to understand dbt cloud scheduler more!!.
https://www.getdbt.com/blog/a-good-problem-to-have/
Sponsored: Firebolt - Database Performance is Not About Performance
In this blog, we argue that performance is actually not about performance at all! We’ll contextualize real-world customer needs for data warehouse performance, and we’ll even make a bold prediction about the future of data warehousing (preview - it’s all about the new CDW).
https://www.firebolt.io/blog/future-of-performance-is-not-about-performance
Zalando: Machine Learning Platform - Architecture and tooling behind machine learning at Zalando
Zalando writes about the architecture and tooling behind its ML platform. The ZFlow on top of the AWS step function and the custom web interface on top of Backstage looks interesting.
https://engineering.zalando.com/posts/2022/04/zalando-machine-learning-platform.html
DoorDash: Building the Model Behind DoorDash’s Expansive Merchant Selection
DoorDash writes about its expansive merchant selection to onboard high-value merchants to ensure the selection in every market matches customer demand. The model strategy to train the customer preference to the merchant onboard looks interesting, but I wonder how the team maintains algorithm fairness? Any potential AI bias can lead to social imbalance, but the blog does not mention how it handles algorithm fairness.
https://doordash.engineering/2022/04/19/building-merchant-selection/
Sponsored: Rudderstack - The Data Stack Show Live: Solving the Data Quality Problem
Data quality issues are universal, and dealing with them at scale is toil. Join The Data Stack Show on Wednesday at 10 PT for a live recording with some of the brightest minds working to solve the problem. Leaders at Bigeye, Great Expectations, Lightup, and Metaplane will discuss why data quality is so challenging and how to fix it.
https://datastackshow.com/live-data-quality/
Blinkit: Evolution of Redash at Blinkit
Blinkit writes about its usage of Redash and narrates the challenges of running the SQL dashboarding tools and how Blinkit effectively solved them.
https://lambda.blinkit.com/evolution-of-redash-at-blinkit-fb50a64770bf
Mikkel Dengsøe: Data tests and the broken windows theory
Building trust in data in an organization is the most crucial function of a data team. The author compares the broken window theory with the data testing function.
https://mikkeldengsoe.substack.com/p/broken-windows
Lil’Log: Learning with not Enough Data
A perfect labeled data is often hard to achieve with cost and the human effort involved. Yet, label data is critical for the supervised learning task. The author discusses the approaches to take when there is not enough labeled data in a three-part series.
Learning with not Enough Data Part 1: Semi-Supervised Learning
Learning with not Enough Data Part 2: Active Learning
Learning with not Enough Data Part 3: Data Generation
Sponsored: Monte Carlo Data - The Modern Data Leader’s Playbook
Learn how today’s best data engineering and analytics leaders are staying ahead of the competition in our exclusive guide.
Download the modern data leader’s playbook
Booking.com: Overtracking and trigger analysis - reducing sample sizes while INCREASING the sensitivity of experiments
An exciting article from booking.com discussing the danger of tracking users who can't be in the treatment category (called overtracking) affects the variance of the experimentation metrics and dilutes the treatment effect, making its detection harder.
Meryam Bukhari: What's the role of an ML PM?
Many companies adopt the product over project strategy and treat the internal platform as a product. The author discusses the role of a product manager in building ML-based products.
https://meryam.substack.com/p/whats-the-role-of-an-ml-pm
All rights reserved Pixel Impex Inc, India. Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.