Data Engineering Weekly #131

The Weekly Data Engineering Newsletter

May 22, 2023

Data Engineering Weekly Is Brought to You by RudderStack

RudderStack provides data pipelines that make collecting data from every application, website, and SaaS platform easy, then activating it in your warehouse and business tools. Sign up free to test out the tool today.

Ramon Marrero: DBT Model Contracts - Importance and Pitfalls

dbt introduces model contract with 1.5 release. There were a few critics of the dbt model implementation, such as The False Promise of dbt Contracts. I found the argument made in the false promise of the dbt contract surprising, especially the below comments.

As a model owner, if I change the columns or types in the SQL, it's usually intentional. - My immediate no reaction was, Hmm, Not really.

However, as with any initial system iteration, the dbt model contract implementation has pros and cons. I’m sure it will evolve as the adoption increases. The author did an amazing job writing a balanced view of dbt model contract.

https://medium.com/geekculture/dbt-model-contracts-importance-and-pitfalls-20b113358ad7

Instacart: How Instacart Ads Modularized Data Pipelines With Lakehouse Architecture and Spark

Instacart writes about its journey of building its ads measurement platform. A couple of thing stands out for me in the blog.

The Event store is moving from S3/ parquet storage to DeltaLake storage—a sign of LakeHouse format adoption across the board.
Instacart adoption of Databricks ecosystem along with Snowflake.
The move to rewrite SQL into a composable Spark SQL pipeline for better readability and testing.

https://tech.instacart.com/how-instacart-ads-modularized-data-pipelines-with-lakehouse-architecture-and-spark-e9863e28488d

Timo Dechau: The extensive guide for Server-Side Tracking

The blog is an excellent overview of server-side event tracking. The author highlights how the event tracking is always close to the UI flow than the business flow and all the possible things wrong with frontend event tracking. A must-read article if you’re passionate about event tracking like me.

https://hipsterdatastack.substack.com/p/the-extensive-guide-for-server-side

Compass: Enterprise Data Platform at Compass

We started to see an increasing pattern of minimizing the data infrastructure and standardization of the data platform. I can also see a wave of migration from data warehouses to LakeHouse formats. Compass writes one such case to move from disparate data infrastructure to Databricks.

https://medium.com/compass-true-north/enterprise-data-platform-compass-4f96eeec1894

Whatnot: Building the Seller Analytics Dashboard

Building and serving customer-facing analytics brings a full-stack data engineering complexity. Whatnot writes about its end-to-end seller analytical dashboard pipeline flow with dbt, Rockset, and Snowflake.

https://medium.com/whatnot-engineering/building-the-seller-analytics-dashboard-ccffd2a0151a

Pinterest: An ML-based approach to proactive advertiser churn prevention

Churn prediction and simulation is a vital part of business operation, and Pinterest writes about an ML approach to predict advertiser churn and prevention. The blog narrates the choice of Gradient Boosting Decision Tree (GBDT) architecture and the use of the SHAP library to estimate the feature contribution to model probability output.

https://medium.com/pinterest-engineering/an-ml-based-approach-to-proactive-advertiser-churn-prevention-3a7c0c335016

Microsoft: Using graphs to model and analyze the customer journey

Graph analytics have wide and exciting applications in data analytics. Representing analytical data in graph vs. relational data structure is a quest for many. Microsoft writes about customer journey analytics using Neo4j.

https://medium.com/data-science-at-microsoft/using-graphs-to-model-and-analyze-the-customer-journey-4b1f1e9f3696

Walmart: Rapid & Reliable ML Experiments Using MLOps Best Practices

Machine learning model development without a structured process is like trying to assemble Ikea furniture in the dark – prepare for chaos, confusion, and possibly a few extra screws! Walmart writes about how not to get into those chaos and best practices using open-source tools.

https://medium.com/walmartglobaltech/rapid-reliable-ml-experiments-using-mlops-best-practices-7f01e563cb3e

Reddit: Wrangling BigQuery at Reddit

There were a few interesting comments on BigQuery about how best tech it is and its worst marketing. Reddit writes about monitoring the slot utilization-based query scheduler in real-time to address the consumers pressing question; why my query is slow!!!

https://www.reddit.com/r/RedditEng/comments/13iat74/wrangling_bigquery_at_reddit/

LINE MAN Wongnai: How we monitor thousands of Spark data pipelines

Staying on monitoring and observability of the data pipeline, LMWN writes about monitoring the Apache Spark pipeline at scale. The blog narrates the overview of its batch pipeline, challenges, and monitoring techniques for observing spark data pipelines.

https://medium.com/@artthananstr/how-we-monitor-thousands-of-spark-data-pipelines-a918c7c7916a

Credit Saison: Using Jira to Automate Updations and Additions of Glue Tables

This Schema change could’ve been a JIRA ticket!!!

I found the article excellent workflow automation on top of the familiar ticketing system, JIRA. The blog narrates the challenges with Glue Crawler and how selectively applying the db changes management using JIRA help to overcome its technical debt of running 6+ hours custom crawler.

https://medium.com/credit-saison-india/using-jira-to-automate-updations-and-additions-of-glue-tables-58d39adf9940

All rights reserved ProtoGrowth Inc, India. I have provided links for informational purposes and do not suggest endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.

Data Engineering Weekly

Discussion about this post

Ready for more?