Data Engineering Weekly Is Brought to You by RudderStack
RudderStack Profiles takes the SaaS guesswork, and SQL grunt work out of building complete customer profiles, so you can quickly ship actionable, enriched data to every downstream team. See how it works today.
Editors Note: 🔥 DEW is thrilled to announce a developer-centric Data Eng & AI conference in the tech hub of Bengaluru, India, on October 12th! 🔥
Data Engineering is advancing rapidly, and it's not just about keeping up - it's about shaping the future! As part of this dynamic community, DEW thrilled to announce the Data Eng & AI conference to strengthen the community even stronger.
This conference isn't just about talks - it's about collaboration, learning, and leading the way. So, let's shape the future of Data Engineering together. Please send your talk proposal today, and let's collaborate to make this the most exciting, insightful, and interactive Data Eng & AI conference yet! Please watch out for this space for the conference registration page soon!!!🌟
Submit Your Conference Talk Proposal Here
LinkedIn: Declarative Data Pipelines with Hoptimator
Data pipeline often requires multiple hop through the system to serve the end customers. LinkedIn write about Hoptimator for auto generated Flink pipeline with multiple stages of systems. Though the system not supporting any LinkedIn production ecosystem, it is an exciting open source project to watch.
https://engineering.linkedin.com/blog/2023/declarative-data-pipelines-with-hoptimator
Apache Doris: Building a Data Warehouse for Traditional Industry
The blog narates the incremental update and the overall update strategy of a typical legacy datawarehouses.
https://blog.devgenius.io/building-a-data-warehouse-for-traditional-industry-722513505c0c
BlaBlaCar: Scaling Data Teams- 5 Learnings from BlaBlaCar
BlaBlaCar shares its lessons learned from 45 people team divided into six teams: 5 multidisciplinary squads and one platform-oriented team. I noticed a Twitter conversation debating whether the data team reports to the CFO and BlaBlaCar aligns with the CTO org.
https://medium.com/blablacar/scaling-data-teams-5-learnings-from-blablacar-9e00949957f3
Sponsored: Atlan AI – The First Copilot for Data Teams
How do you create documentation for 1000s of data assets in minutes? Write SQL queries without learning SQL? Find the right, trusted data by simply asking a question. If you’re searching for an answer, it’s Atlan AI.
Atlan AI leverages metadata that Atlan captures across the data stack to make AI part of your data stack. You can get hours back by letting Atlan AI draft your documentation and write your queries. And you won’t have to ask 3 different people in your team just to find the right, trusted data — you just need your AI copilot.
Watch the Atlan AI demo and join the Atlan AI waitlist →
Ranjay Kumar: The Significance of First and Last Day of the Month in Data Engineering
The article gave me a very interesting perspective, and I thought about how the oncall structure should be extra cautious during these business milestones. Aligning Data Pipeline operations with the business milestone dates is an interesting experiment to try at work for me.
https://medium.com/@ranjayd/the-first-and-last-day-of-the-month-in-data-engineering-dcf39995d6bc
Sven Balnojan: Explain like I’m 5 — Vector Database Hype
Of all the LLM hype, I'm very excited about the emergence of vector databases. The author made an excellent attempt to describe what is vector databases to a 5-year-old.
https://medium.com/geekculture/explain-like-im-5-vector-database-hype-bd936fd319ff
Timescale: PostgreSQL as a Vector Database: Create, Store, and Query OpenAI Embeddings With pgvector
Vector databases are great, but it takes years to mature. Can't we use the vector feature in the existing databases? MongoDB Atlas recently announced the vector search capabilities for MongoDB. Similarly, the blog narrates how to use vector searching in Postgres using PGVector.
Sponsored: New eBook: Why You Need A Good Data Product Manager (And What They Do)
More data teams are putting data product managers at the helm of their most important data projects and assets. Should you? Our latest guide defines the roles and reponsibilities of data product managers, key differences between managers and engineers, how to measure success, and more.
Lyft: Building Real-time Machine Learning Foundations at Lyft
Lyft writes about LyftLearn, its internal real-time machine learning platform. The blog narrates various components & challenges of building a real-time machine learning infrastructure along with the business case to support the real-time system.
https://eng.lyft.com/building-real-time-machine-learning-foundations-at-lyft-6dd99b385a4e
DoorDash: How DoorDash Built an Ensemble Learning Model for Time Series Forecasting
Ensemble learning is a machine learning concept where multiple models (often called "base models" or "weak learners") are trained to solve the same problem. The primary idea behind ensemble methods is that a group of weak learners can come together to form a strong learner.
DoorDash writes about ELITE, an inhouse ensemble model for realtime forecasting to balance the cost, speed & correctness.
Sponsored: Introducing RudderStack Profiles: Easily Build Complete Customer Profiles in Your Warehouse
With Profiles, you can create features using pre-defined projects in the user interface or a version-controlled config file, and all of the queries and computations are taken care of automatically.
Now, every team can build a customer360 in Snowflake (see our Snowflake Integrations) with RudderStack Profiles. The new product is a data unification tool that handles the heavy lifting of identity resolution for you, streamlines user feature development, and automatically builds a customer 360 table so you can ship high-impact projects faster.
Capital One: ARIMA model tips for time series forecasting in Python
The ARIMA model is widely used in industry due to its versatility and interpretability. It adeptly handles diverse time series data, capturing trends and seasonality for reliable forecasting. Capital One writes about an introduction to ARIMA models for time series forecasting in Python with tips, best practices, and examples.
https://www.capitalone.com/tech/machine-learning/arima-model-time-series-forecasting/
Spotify: Experimenting with Machine Learning to Target In-App Messaging
The usage of Machine Learning spread across all aspect of the application lifecycle. Spotify writes about its usage of machine learning to target in-app messages. The article remind me to re-read about the contextual bandit which is an excellent refresher.
All rights reserved ProtoGrowth Inc, India. I have provided links for informational purposes and do not suggest endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.