Data Engineering Weekly #137

The Weekly Data Engineering Newsletter

Jul 03, 2023

Data Engineering Weekly Is Brought to You by RudderStack

RudderStack Profiles takes the SaaS guesswork, and SQL grunt work out of building complete customer profiles, so you can quickly ship actionable, enriched data to every downstream team. See how it works today.

Editors Note: 🔥 DEW is thrilled to announce a developer-centric Data Eng & AI conference in the tech hub of Bengaluru, India, on October 12th! 🔥

Data Engineering is advancing rapidly, and it's not just about keeping up - it's about shaping the future! As part of this dynamic community, DEW thrilled to announce the Data Eng & AI conference to strengthen the community even stronger.

This conference isn't just about talks - it's about collaboration, learning, and leading the way. So, let's shape the future of Data Engineering together. Please send your talk proposal today, and let's collaborate to make this the most exciting, insightful, and interactive Data Eng & AI conference yet! Please watch out for this space for the conference registration page soon!!!🌟

Submit Your Conference Talk Proposal Here

LinkedIn: Declarative Data Pipelines with Hoptimator

Data pipeline often requires multiple hop through the system to serve the end customers. LinkedIn write about Hoptimator for auto generated Flink pipeline with multiple stages of systems. Though the system not supporting any LinkedIn production ecosystem, it is an exciting open source project to watch.

https://engineering.linkedin.com/blog/2023/declarative-data-pipelines-with-hoptimator

Apache Doris: Building a Data Warehouse for Traditional Industry

The blog narates the incremental update and the overall update strategy of a typical legacy datawarehouses.

https://blog.devgenius.io/building-a-data-warehouse-for-traditional-industry-722513505c0c

BlaBlaCar: Scaling Data Teams- 5 Learnings from BlaBlaCar

BlaBlaCar shares its lessons learned from 45 people team divided into six teams: 5 multidisciplinary squads and one platform-oriented team. I noticed a Twitter conversation debating whether the data team reports to the CFO and BlaBlaCar aligns with the CTO org.

https://medium.com/blablacar/scaling-data-teams-5-learnings-from-blablacar-9e00949957f3

Ranjay Kumar: The Significance of First and Last Day of the Month in Data Engineering

The article gave me a very interesting perspective, and I thought about how the oncall structure should be extra cautious during these business milestones. Aligning Data Pipeline operations with the business milestone dates is an interesting experiment to try at work for me.

https://medium.com/@ranjayd/the-first-and-last-day-of-the-month-in-data-engineering-dcf39995d6bc

Sven Balnojan: Explain like I’m 5 — Vector Database Hype

Of all the LLM hype, I'm very excited about the emergence of vector databases. The author made an excellent attempt to describe what is vector databases to a 5-year-old.

https://medium.com/geekculture/explain-like-im-5-vector-database-hype-bd936fd319ff

Timescale: PostgreSQL as a Vector Database: Create, Store, and Query OpenAI Embeddings With pgvector

Vector databases are great, but it takes years to mature. Can't we use the vector feature in the existing databases? MongoDB Atlas recently announced the vector search capabilities for MongoDB. Similarly, the blog narrates how to use vector searching in Postgres using PGVector.

https://www.timescale.com/blog/postgresql-as-a-vector-database-create-store-and-query-openai-embeddings-with-pgvector/

Lyft: Building Real-time Machine Learning Foundations at Lyft

The steps of the ML Model Development Life Cycle at Lyft.

Lyft writes about LyftLearn, its internal real-time machine learning platform. The blog narrates various components & challenges of building a real-time machine learning infrastructure along with the business case to support the real-time system.

https://eng.lyft.com/building-real-time-machine-learning-foundations-at-lyft-6dd99b385a4e

DoorDash: How DoorDash Built an Ensemble Learning Model for Time Series Forecasting

Ensemble learning is a machine learning concept where multiple models (often called "base models" or "weak learners") are trained to solve the same problem. The primary idea behind ensemble methods is that a group of weak learners can come together to form a strong learner.

Figure 1: Two-layer (base and ensemble) model workflow and architecture

DoorDash writes about ELITE, an inhouse ensemble model for realtime forecasting to balance the cost, speed & correctness.

https://doordash.engineering/2023/06/20/how-doordash-built-an-ensemble-learning-model-for-time-series-forecasting/

Capital One: ARIMA model tips for time series forecasting in Python

The ARIMA model is widely used in industry due to its versatility and interpretability. It adeptly handles diverse time series data, capturing trends and seasonality for reliable forecasting. Capital One writes about an introduction to ARIMA models for time series forecasting in Python with tips, best practices, and examples.

https://www.capitalone.com/tech/machine-learning/arima-model-time-series-forecasting/

Spotify: Experimenting with Machine Learning to Target In-App Messaging

The usage of Machine Learning spread across all aspect of the application lifecycle. Spotify writes about its usage of machine learning to target in-app messages. The article remind me to re-read about the contextual bandit which is an excellent refresher.

https://engineering.atspotify.com/2023/06/experimenting-with-machine-learning-to-target-in-app-messaging/

All rights reserved ProtoGrowth Inc, India. I have provided links for informational purposes and do not suggest endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.

Data Engineering Weekly