Data Engineering Weekly #57

Data Engineering Weekly - Brought to You by RudderStack - the Customer Data Platform for Developers

RudderStack Provides data pipelines that make it easy to collect data from every application, website, and SaaS platform, then activate it in your warehouse and business tools.

Event: Join Impact 2021 on November 3, 2021: The First-Ever Data Observability Summit. Join Today’s Leading Data Pioneers

Hear from data leaders pioneering the technologies & processes shaping data engineering. Featuring First Chief Data Scientist of the U.S., founder of the Data Mesh, and many more!

Click To Get Your Free Ticket For All Data Engineering Weekly Readers

Vicki Boykis: Reaching MLE (machine learning enlightenment)

Vicki Boykis writes exciting storytelling of the current state of the ML engineer on embracing the reality of building production-ready Machine Learning insights. The highlight of the blog

Machine learning systems are new. We’re still in the steam-powered days of machine learning, yet machine learning is not simply machine learning. It is, at this stage, more engineering than simply machine learning. We’re building more and more on older systems, abstracting away complexity, and in the process creating newer and newer levels of it that we now have to manage and hold in our heads. Many of the algorithms have been written. Much of the work we do, both in machine learning and in development today in general, will be glue work and vendor work.

Uber: Real-Time Exactly-Once Ad Event Processing with Apache Flink, Kafka, and Pinot

Flink’s checkpoint support for a two-phase commit protocol enables exactly-once semantics in Apache Flink. Uber writes about extending Flink’s exactly-once semantics with Apache Pinot’s upsert support to achieve end-to-end pipeline’s exactly-once semantics.

Airbnb: How Airbnb Enables Consistent Data Consumption at Scale

Airbnb writes the third part of the data consistency at scale, talking about Minerva, its metrics infrastructure. The approach of metrics-centric instead of the traditional BI approach of table-centric is an exciting one to read.

Microsoft: How we used ML — and heuristic data labeling — to help customers with their cloud migration

Lift and ship an infrastructure is always challenging with many unknowns. Microsoft writes an exciting blog on how it is using ML to bring observability to the migration process.

PayPal: Comparing BigQuery Processing and Spark Dataproc

PayPal writes about data processing performance and cost comparison on running the analytics query on BigQuery vs. GCS. It is interesting to see the trends on more analytical workload moving to the cloud from big internet companies.

Sponsored: RudderStack - Churn Prediction With BigQueryML to Increase Mobile Game Revenue

Here’s an interesting case study on how machine learning can directly impact the bottom line. RudderStack writes an outline of how app developers, Torpedo Labs, use BigQuery ML to identify high-value mobile game players who are dangerously close to churning.

Adevinta: Treating data as a product at Adevinta

Adevinta writes about its journey to adopt data as a product and the infrastructure changes around it that lead 13% increase in weekly querying users. I’m curious to see the learnings from each data as a product has its repository of code isolated from the rest approach.

James Le: What I Learned From Attending Tecton apply(meetup) 2021

James Le shared comprehensive notes from the recent Tecton apply(meetup) 2021. The rule-based data profiling from GE, ML software hierarchy, and interactive ML are exciting reads.

PayPal: A Journey from Software to Machine Learning Engineer at iZettle

A great experience sharing blog on the journey from software engineering to Machine Learning engineering, highlighting the learning process and what books and courses can't teach.

Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.