Data Engineering Weekly #89
The Weekly Data Engineering Newsletter
Data Engineering Weekly Is Brought to You by RudderStack
RudderStack provides data pipelines that make it easy to collect data from every application, website, and SaaS platform, then activate it in your warehouse and business tools. Sign up free to test out the tool today.
Airbnb: Graph Machine Learning at Airbnb
We can frame many real-world machine learning and data analytics problems as graph problems. It's a constant question in me, yet we do not leverage graph modeling as much and treat it as an independent entity. Airbnb writes an exciting blog describing the benefit of using graphs for machine learning.
https://medium.com/airbnb-engineering/graph-machine-learning-at-airbnb-f868d65f36ee
LinkedIn: Towards data quality management at LinkedIn
LinkedIn writes about its data health monitor architecture to track data freshness and schema changes to detect the health of the data. It will be interesting to see the architecture evolves from “monitoring-after-the-fact.” to a preventive solution
https://engineering.linkedin.com/blog/2022/towards-data-quality-management-at-linkedin.
Spotify: How We Built Infrastructure to Run User Forecasts at Spotify
Forecasting key business metrics quarterly, weekly, or daily helps the business monitor performance, make business decisions, and improve our product offerings. Spotify writes about its forecasting infrastructure and lessons learned.
Grab: Automated Experiment Analysis - Making experimental analysis scalable.
Grab writes about its automated experimentation analytics to automate the basic analytics. The separation of metrics configuration and computation, dataset classification of bronze and gold datasets, and star schema modeling practices are exciting to read.
https://engineering.grab.com/automated-experiment-analysis
Sponsored: Firebolt - Embedded Analytics vs Data Apps
But Data Apps is still a loosely defined term, and there’s a lot of debate and confusion about what it really means, and how it differs from traditional dashboarding and embedded analytics. Boaz Farkash shares his point of view on the subject.
https://www.firebolt.io/blog/embedded-analytics-vs-data-apps
Etsy: Using Real-Time Streaming to Power Etsy's Offsite Ads
The streaming analytics from the transactional database to the analytical applications is always challenging. The infrastructure to stitch and maintain is expensive to maintain. Etsy writes about one of its challenges of building an analytical app using the change data capture.
https://www.etsy.com/codeascraft/using-real-time-streaming-to-power-etsy-offsite-ads
Gradient Flow: Distributed Computing for AI - A Status Report
Gradient Flow publishes the distributed computing roles and needs for each stage of the ML lifecycle to demonstrate the overlap of AI & distributed computing.
https://gradientflow.com/distributed-computing-for-ai-a-status-report/
Petr Janda: A path towards a data platform that aligns data, value, and people
The rapid expansion of additional data sources and the expansion of the use cases brings challenges to the modern cloud data infrastructure. The blog narrates the challenges and why the data product approach solves these emerging issues.
https://petrjanda.substack.com/p/a-path-towards-a-data-platform-that
Chad Sanderson: The Death of Data Modeling - Pt. 1
Is data modeling dead? The author narrates the importance of data modeling and why it’s hard to adopt in modern data technologies. The traditional approach of a centralized data modeling team won’t scale, and the author calls for rethinking data modeling.
https://dataproducts.substack.com/p/the-death-of-data-modeling-pt-1
Sponsored: RudderStack - What is the Growth Stack?
A detailed guide to building the Growth Stack—an architecture to centralize every data point into a comprehensive source of truth and activate that centralized data in downstream tools. The growth stack is phase two of RudderStack's Data Maturity Journey framework.
https://www.rudderstack.com/blog/what-is-the-growth-stack
Jarek Potiuk: Airflow Summit 2022 — The Best Of
Airflow summit hosts some excellent talks in data engineering, and the author summarizes the conference talks here.
https://potiuk.com/airflow-summit-2022-the-best-of-373bee2527fa
Zalando: Accelerate testing in Apache Airflow through DAG versioning
Zalando writes about how to version the Airflow DAGs on a single server through isolated pipeline and data environments to enable more convenient simulation and testing.
Altexsoft: Customer Churn Prediction Using Machine Learning - Main Approaches and Models
An excellent overview of how the SaaS companies handle the customer churn prediction and the models and approaches using machine learning to predict the customer churn.
https://www.kdnuggets.com/2019/05/churn-prediction-machine-learning.html
HomeToGo Engineering: How HomeToGo connected dbt and Superset to make metadata more accessible and reduce analytical overhead
HomeToGo writes about integrating dbt metadata with Superset making the metadata available for easier consumption by the data consumers. The approach to push the metrics definition from Superset to a more source-controlled solution and using dbt manifest as a source of truth is fascinating.
All rights reserved Pixel Impex Inc, India. Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.