Data Engineering Weekly #93

The Weekly Data Engineering Newsletter

Jul 18, 2022

Joseph Gonzalez: How Machine Learning Became Useful - Reflecting on a Decade of Research

Everyone talked about ML, but what they needed was data

An excellent article captures the author's observation on industrial-scale ML engineering. The author stresses the importance of access to the data & compute with simple abstractions is the key for ML. It is an excellent read on the next significant challenges of ML.

https://medium.com/@profjoeyg/how-machine-learning-became-useful-5732c3419c81

Airbnb: How Airbnb Safeguards Changes in Production - Evolution of Airbnb’s experimentation platform

Data Engineering practices & A/B testing have broader applications in reliability engineering. While working for the Slack observability team, I realized a simple Apdex score computation on the HTTP error code is a significant feedback loop to improve the reliability of the systems. Airbnb writes an excellent blog narrating how it uses AB testing & data science to safeguard changes in production.

https://medium.com/airbnb-engineering/how-airbnb-safeguards-changes-in-production-9fc9024f3446

Criteo: Scheduling Data Pipelines at Criteo - Introducing Criteo’s BigDataFlow project

Criteo writes about its homegrown scheduler Cuttle and a self-service web application called BigDataflow as an abstraction for Cuttle. The design centered around building a simpler task abstraction. By inferring the DAG of tasks statically, the platform implements all the things workflow management systems like task execution, data retention & backfilling.

Part 1: https://medium.com/criteo-engineering/scheduling-data-pipelines-at-criteo-part-1-8b257c6c8e55

Part 2: https://medium.com/criteo-engineering/scheduling-data-pipelines-at-criteo-part-2-8b0da38ff3a4

Riskified Technology: Spark Streaming as a Service

Riskfield writes about its self-serve streaming infrastructure using Spark streaming. The first blocker for a platform is to minimize the manual-intensive tasks that can create unavoidable bureaucracy. It's admirable that the author called it out explicitly in the blog.

https://medium.com/riskified-technology/spark-streaming-as-a-service-53420d8a857b

Mikkel Dengsøe: Data teams are getting larger, faster - On the relationship between data team size and complexity

Ananth Packkildurai @ananthdurai

As the size of the organization grows, the data maturity shrink. The complexity of the data outgrown the usability the data. Have anyone seen this pattern? Curious to know data folks' thoughts on it.

I observed that data maturity shrinks as an organization grows a few months back. The author establishes the same case and shares a few techniques to handle the complexity.

https://mikkeldengsoe.substack.com/p/data-team-size

LinkedIn: Measuring marketing incremental impacts beyond last click attribution

A marketing campaign can reach the consumer via multiple channels. The last-click conversion measurement can misrepresent the influence of a marketing channel. LinkedIn writes about measuring incremental impacts using the Bayesian Structural Time Series (BSTS) model approach to measure the causal effect of an intervention.

https://engineering.linkedin.com/blog/2022/measuring-marketing-incremental-impacts

Shippeo: Debezium to Snowflake - Lessons learned building data replication in production

Shippeo writes about its CDC infrastructure using Debezium, Kafka & Snowflake. The blog includes excellent insights on data format choice, Postgres reliability, and observability metrics.

https://medium.com/shippeo-tech-blog/debezium-to-snowflake-lessons-learned-building-data-replication-in-production-a5430a9fe85b

GovTech Singapore: Towards a comparable metric for AI model interpretability

Data Engineering Weekly frequently features an article about the social impact of AI. The XAI (explainable AI) plays a vital role in bringing transparency to the AI system. The two-part blog from GovTech Singapore explains the applications of XAI and an overview of XAI methods.

Part 1: https://medium.com/dsaid-govtech/towards-a-comparable-metric-for-ai-model-interpretability-part-1-d55d4bae8a58

Part 2: https://medium.com/dsaid-govtech/towards-a-comparable-metric-for-ai-model-interpretability-part-2-423e4fc2b232

Policygenius: How we implemented a Tableau governance strategy

Policygenius writes about implementing the data governance policy for Tableau. The blog highlights the classic dashboard management problem and explains how a decentralized data-driven approach helped them to scale the data governance strategy.

https://medium.com/policygenius-stories/how-we-implemented-a-tableau-governance-strategy-59c055727433

Guang X: Lessons learned from Azure Data Factory

I hear significantly less about the Azure services for data engineering. The blog is an excellent first insight into the lesson learned from Azure Data Factory.

https://medium.com/@guangx/lessons-learned-from-azure-data-factory-4778eca0fc25

Jellysmack Labs: How Jellysmack Pushed Data Science Jobs Orchestration to a Production-ready Level

Jellysmack Labs writes about its Airflow usage in running production-ready data science jobs. The focus on DAG quality and uniformity in declaring a DAG & task is an exciting read.

https://medium.com/jellysmacklabs/how-jellysmack-pushed-data-science-jobs-orchestration-to-a-production-ready-level-e92dc4786413

Walmart Global Tech: DataBathing — A Framework for Transferring the Query to Spark Code

Staying with the uniformity of the jobs, Walmart writes about DataBathing, a framework to transpile the SQL to Spark Dataframe calculation flow code. The blog claim the performance increased from 10 to 80% with DataBathing. The blog mentioned that there would be a follow-up blog post on why such a transformer is required. I'm more curious to read the follow-up.

https://medium.com/walmartglobaltech/databathing-a-framework-for-transferring-the-query-to-spark-code-484957a7e049

All rights reserved Pixel Impex Inc, India. Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.

Data Engineering Weekly