Data Engineering Weekly #93
The Weekly Data Engineering Newsletter
Sponsored: Rudderstack - DROP the Modern Data Stack
It’s time to make sense of today’s data tooling ecosystem. Check out rudderstack.com/dmds to get a guide that will help you build a practical data stack for every phase of your company’s journey to data maturity. The guide includes architecture and tactical advice to help you progress through four stages: Starter, Growth, Machine Learning, and Real-Time. Visit RudderStack.com/dmds today to DROP the modern data stack and USE a practical data engineering framework.
Joseph Gonzalez: How Machine Learning Became Useful - Reflecting on a Decade of Research
Everyone talked about ML, but what they needed was data
An excellent article captures the author's observation on industrial-scale ML engineering. The author stresses the importance of access to the data & compute with simple abstractions is the key for ML. It is an excellent read on the next significant challenges of ML.
Airbnb: How Airbnb Safeguards Changes in Production - Evolution of Airbnb’s experimentation platform
Data Engineering practices & A/B testing have broader applications in reliability engineering. While working for the Slack observability team, I realized a simple Apdex score computation on the HTTP error code is a significant feedback loop to improve the reliability of the systems. Airbnb writes an excellent blog narrating how it uses AB testing & data science to safeguard changes in production.
Criteo: Scheduling Data Pipelines at Criteo - Introducing Criteo’s BigDataFlow project
Criteo writes about its homegrown scheduler Cuttle and a self-service web application called BigDataflow as an abstraction for Cuttle. The design centered around building a simpler task abstraction. By inferring the DAG of tasks statically, the platform implements all the things workflow management systems like task execution, data retention & backfilling.
Part 1: https://medium.com/criteo-engineering/scheduling-data-pipelines-at-criteo-part-1-8b257c6c8e55
Part 2: https://medium.com/criteo-engineering/scheduling-data-pipelines-at-criteo-part-2-8b0da38ff3a4
Riskified Technology: Spark Streaming as a Service
Riskfield writes about its self-serve streaming infrastructure using Spark streaming. The first blocker for a platform is to minimize the manual-intensive tasks that can create unavoidable bureaucracy. It's admirable that the author called it out explicitly in the blog.
Sponsored: Firebolt - Firebolt is a proud sponsor of Data Engineering Weekly.
Firebolt is the cloud data warehouse for builders of next-gen analytics experiences.
Combining the benefits and ease of use of modern architecture with a
sub-second performance at a terabyte-scale, Firebolt helps data engineering
and dev teams deliver data applications that end-users love.
Mikkel Dengsøe: Data teams are getting larger, faster - On the relationship between data team size and complexity
I observed that data maturity shrinks as an organization grows a few months back. The author establishes the same case and shares a few techniques to handle the complexity.
LinkedIn: Measuring marketing incremental impacts beyond last click attribution
A marketing campaign can reach the consumer via multiple channels. The last-click conversion measurement can misrepresent the influence of a marketing channel. LinkedIn writes about measuring incremental impacts using the Bayesian Structural Time Series (BSTS) model approach to measure the causal effect of an intervention.
Shippeo: Debezium to Snowflake - Lessons learned building data replication in production
Shippeo writes about its CDC infrastructure using Debezium, Kafka & Snowflake. The blog includes excellent insights on data format choice, Postgres reliability, and observability metrics.
Sponsored: Rudderstack - The Data Maturity Journey - Webinar July 27th at 10:30 AM PT / 1:30 ET
Join RudderStack live with the Seattle Data Guy, Ben Rogojan, and Max Werner, Owner at Obsessive Analytics Consulting, to learn about the four stages of The Data Maturity Journey. You'll come away with practical architectures you can use to drive better decision making at every stage of your companies growth.
GovTech Singapore: Towards a comparable metric for AI model interpretability
Data Engineering Weekly frequently features an article about the social impact of AI. The XAI (explainable AI) plays a vital role in bringing transparency to the AI system. The two-part blog from GovTech Singapore explains the applications of XAI and an overview of XAI methods.
Part 1: https://medium.com/dsaid-govtech/towards-a-comparable-metric-for-ai-model-interpretability-part-1-d55d4bae8a58
Part 2: https://medium.com/dsaid-govtech/towards-a-comparable-metric-for-ai-model-interpretability-part-2-423e4fc2b232
Policygenius: How we implemented a Tableau governance strategy
Policygenius writes about implementing the data governance policy for Tableau. The blog highlights the classic dashboard management problem and explains how a decentralized data-driven approach helped them to scale the data governance strategy.
Guang X: Lessons learned from Azure Data Factory
I hear significantly less about the Azure services for data engineering. The blog is an excellent first insight into the lesson learned from Azure Data Factory.
Jellysmack Labs: How Jellysmack Pushed Data Science Jobs Orchestration to a Production-ready Level
Jellysmack Labs writes about its Airflow usage in running production-ready data science jobs. The focus on DAG quality and uniformity in declaring a DAG & task is an exciting read.
Walmart Global Tech: DataBathing — A Framework for Transferring the Query to Spark Code
Staying with the uniformity of the jobs, Walmart writes about DataBathing, a framework to transpile the SQL to Spark Dataframe calculation flow code. The blog claim the performance increased from 10 to 80% with DataBathing. The blog mentioned that there would be a follow-up blog post on why such a transformer is required. I'm more curious to read the follow-up.
All rights reserved Pixel Impex Inc, India. Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.