Data Engineering Weekly #122

The Weekly Data Engineering Newsletter

Mar 13, 2023

Contribute to the Rudderstack Transformations Library, Win $1000

RudderStack Transformations lets you customize event data in real time with your own JavaScript or Python code. Now you can win $1,000 cash by contributing a Transformation to our open-source library.

https://www.rudderstack.com/blog/join-the-transformations-challenge-for-a-chance-to-win/

Editor’s Note: Data Engineering Radio

At Data Engineering Weekly, We strive to bring the best thought process around building and operating data. However, the newsletter has its limitation. When I read articles for Data Engineering Weekly, there are many instances I say to myself, this is amazing work; I want to talk to the author to learn more from them or discuss it.

With this selfish quest for knowing more, Ashwin & I started a new Podcast. We call it Data Engineering Weekly Radio. We take three articles and do an in-depth analysis, and we hope to bring the author of the blogs to discuss more.

We will publish all the podcasts in Substack. You can also listen to the podcast in

Apple

Spotify:

Please share your feedback on how we can improve the show further.

Pedram Navid: dbt Reimagined

The author writes about reimagining dbt and suggesting the possible area of improvement in dbt's core product.

DSLs over Templated Code
Debuggers [make it easy to debug with code editor breakpoints]
Unit tests

I can’t emphasize the importance of DSLs over templated code. The advantage of DSL is that you can make more productivity improvements with compilers such as type safety, debuggers, etc. I expressed a similar thought sometime back.

Please reach out if you come across any SQLish DSL for data transformation; I would love to discuss it more, and I hope that will help Pedram to sleep better 😀

https://pedram.substack.com/p/dbt-reimagined

Max Illis: On Data Products and how to describe them

There is an increasingly healthy conversation about treating Data as a Product to bring product thinking to Data Asset Creation & Lifecycle management process. Max Illis, a leading thought leader in this space, describes the importance of the Data Product approach and the detailed information checklist to publish, discover and manage a data product.

https://medium.com/@maxillis/on-data-products-and-how-to-describe-them-76ae1b7abda4

Brex: Change Data Capture at Brex

Brex writes an in-depth article about the technical implementation of its CDC pipeline combined with transaction event publishing with an outbox pattern. The blog narrates the architecture to implement an outbox pattern with Debezium, the usage of the outbox router in Debezium, and lessons learned.

https://medium.com/brexeng/change-data-capture-at-brex-c71263616dd7

Uber: D3- An Automated System to Detect Data Drifts

No Data! No Problem, But Partial Data a big problem

Uber highlighted how the partial data caused almost half of their data issues. Uber writes about D3 - an automated system to detect data drift. The blog highlights some common problems with data drift and D3 architecture to detect and alter data drift.

https://www.uber.com/en-US/blog/d3-an-automated-system-to-detect-data-drifts/

Grab: Migrating from Role to Attribute-based Access Control

Privacy and access control are the basic components of data engineering. Grab writes about switching access control from a role-based approach to an attribute-based model. The blog highlights some of the practical limitations of role-based access

too many roles in controlling the access
increase backlog due to the managerial approval process for granting a role
Stale group membership gives access to members that they should not have.

https://engineering.grab.com/migrating-to-abac

IBM: Feature Platforms — A New Paradigm in Machine Learning Operations (MLOps)

AI/ ML systems came a long way from the McKinsey study of an estimated 88% of machine learning models that were never taken into production in 2017. Feature stores are on the rise, and the author narrates what is feature platform is and the components of a feature platform.

Feature design
Feature catalog
Feature computation engine
Feature governance
Feature monitoring

https://medium.com/ibm-data-ai/feature-platforms-a-new-paradigm-in-machine-learning-operations-mlops-24c1ff87b7e1

Tomas Fernandez: How to Manage Databases with CI/CD

Can we apply software engineering principles in managing databases? Though the blog narrates how to adopt CI/CD in managing operational data stores, many ideas expressed in the blog apply to data pipelines. Some of my favorites

Commit database scripts to version control
Decouple deployment from data migrations
Keep changes small
Make migrations additive
Consider blue-green deployments

https://hackernoon.com/how-to-manage-databases-with-cicd

LinkedIn: Reducing Apache Spark Application Dependencies Upload by 99%

The fat jar artifact has its latency toll to pay while deploying the Spark job. The obvious choice is to use caching for the libraries to reduce the upload time. Should it be user-level caching or cluster-level caching? LinkedIn writes about dependency caching solutions and why they adopted user-level caching instead of cluster-level.

https://engineering.linkedin.com/blog/2023/reducing-apache-spark-application-dependencies-upload-by-99-

Microsoft: Speeding up “Reverse ETL”

Reverse ETL is an approach to bring data from a central warehouse/lake/lakehouse into real-time operating systems such as Salesforce, Marketo, or Zendesk. The access pattern for Reverse ETL is mostly a bulk fetch and insert approach. The blog narrates how one can optimize SQL Server support reverse ETL workload.

https://medium.com/data-science-at-microsoft/speeding-up-reverse-etl-3af04e069fd1

DoubleVerify Engineering: Modernizing Data Pipelines with DBT

DoubleVerify writes about its debt adoption story and how it helped to modernize its data pipelines. TIL about Data Mock Tool (DMT), and looking forward to playing with it.

https://medium.com/doubleverify-engineering/modernizing-data-pipelines-with-dbt-c2941be74b13

Max Halford: Online gradient descent written in SQL

How far can we go with SQL? The answer is as far as possible. The author demonstrated how to implement online gradient descent in SQL.

https://maxhalford.github.io/blog/ogd-in-sql/

If you wonder what is recursive in SQL example, the article below explains a few examples of implementing recursive in SQL.

https://medium.com/swlh/recursion-in-sql-explained-graphically-679f6a0f143b

All rights reserved ProtoGrowth Inc, India. I have provided links for informational purposes and do not suggest endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.

Data Engineering Weekly