Contribute to the Rudderstack Transformations Library, Win $1000
RudderStack Transformations lets you customize event data in real time with your own JavaScript or Python code. Now you can win $1,000 cash by contributing a Transformation to our open-source library.
https://www.rudderstack.com/blog/join-the-transformations-challenge-for-a-chance-to-win/
Editor’s Note: Data Engineering Radio
At Data Engineering Weekly, We strive to bring the best thought process around building and operating data. However, the newsletter has its limitation. When I read articles for Data Engineering Weekly, there are many instances I say to myself, this is amazing work; I want to talk to the author to learn more from them or discuss it.
With this selfish quest for knowing more, Ashwin & I started a new Podcast. We call it Data Engineering Weekly Radio. We take three articles and do an in-depth analysis, and we hope to bring the author of the blogs to discuss more.
We will publish all the podcasts in Substack. You can also listen to the podcast in
Apple
Spotify:
Please share your feedback on how we can improve the show further.
Pedram Navid: dbt Reimagined
The author writes about reimagining dbt and suggesting the possible area of improvement in dbt's core product.
DSLs over Templated Code
Debuggers [make it easy to debug with code editor breakpoints]
Unit tests
I can’t emphasize the importance of DSLs over templated code. The advantage of DSL is that you can make more productivity improvements with compilers such as type safety, debuggers, etc. I expressed a similar thought sometime back.
Please reach out if you come across any SQLish DSL for data transformation; I would love to discuss it more, and I hope that will help Pedram to sleep better 😀
https://pedram.substack.com/p/dbt-reimagined
Max Illis: On Data Products and how to describe them
There is an increasingly healthy conversation about treating Data as a Product to bring product thinking to Data Asset Creation & Lifecycle management process. Max Illis, a leading thought leader in this space, describes the importance of the Data Product approach and the detailed information checklist to publish, discover and manage a data product.
https://medium.com/@maxillis/on-data-products-and-how-to-describe-them-76ae1b7abda4
Brex: Change Data Capture at Brex
Brex writes an in-depth article about the technical implementation of its CDC pipeline combined with transaction event publishing with an outbox pattern. The blog narrates the architecture to implement an outbox pattern with Debezium, the usage of the outbox router in Debezium, and lessons learned.
https://medium.com/brexeng/change-data-capture-at-brex-c71263616dd7
Uber: D3- An Automated System to Detect Data Drifts
No Data! No Problem, But Partial Data a big problem
Uber highlighted how the partial data caused almost half of their data issues. Uber writes about D3 - an automated system to detect data drift. The blog highlights some common problems with data drift and D3 architecture to detect and alter data drift.
https://www.uber.com/en-US/blog/d3-an-automated-system-to-detect-data-drifts/
Sponsored: [New Guide] The Ultimate Guide to Data Mesh Architecture
If implementing data mesh is high on your list of priorities, you’re not alone. As organizations scale their use of data, centralized architectures can prevent data teams from keeping pace with stakeholder demands and system needs. In this guide, learn through strategies deployed by leading data teams that have successfully implemented data mesh.
Get The Guide
Grab: Migrating from Role to Attribute-based Access Control
Privacy and access control are the basic components of data engineering. Grab writes about switching access control from a role-based approach to an attribute-based model. The blog highlights some of the practical limitations of role-based access
too many roles in controlling the access
increase backlog due to the managerial approval process for granting a role
Stale group membership gives access to members that they should not have.
https://engineering.grab.com/migrating-to-abac
IBM: Feature Platforms — A New Paradigm in Machine Learning Operations (MLOps)
AI/ ML systems came a long way from the McKinsey study of an estimated 88% of machine learning models that were never taken into production in 2017. Feature stores are on the rise, and the author narrates what is feature platform is and the components of a feature platform.
Feature design
Feature catalog
Feature computation engine
Feature governance
Feature monitoring
Sponsored: RudderStack Transformations - Move Faster and Build Data Trust
With Device Mode Transformations, you can transform data sent to downstream integrations running in device mode. When destination integrations are set up in device mode, RudderStack loads that tool's native SDK asynchronously and sends event data directly to the destination from the device itself (i.e., from the browser or mobile app).
RudderStack Product manager, Badri Veeraragavan, details a few big updates to RudderStack's beloved data transformation feature. New features include Python Transformations (including Libraries and Transformations API), Transformation Templates, and Device Mode Transformations. 75% of RudderStack users already leverage Transformations, and now they're even more powerful.
https://www.rudderstack.com/blog/transformations-move-faster-and-build-data-trust/
Tomas Fernandez: How to Manage Databases with CI/CD
Can we apply software engineering principles in managing databases? Though the blog narrates how to adopt CI/CD in managing operational data stores, many ideas expressed in the blog apply to data pipelines. Some of my favorites
Commit database scripts to version control
Decouple deployment from data migrations
Keep changes small
Make migrations additive
Consider blue-green deployments
https://hackernoon.com/how-to-manage-databases-with-cicd
LinkedIn: Reducing Apache Spark Application Dependencies Upload by 99%
The fat jar artifact has its latency toll to pay while deploying the Spark job. The obvious choice is to use caching for the libraries to reduce the upload time. Should it be user-level caching or cluster-level caching? LinkedIn writes about dependency caching solutions and why they adopted user-level caching instead of cluster-level.
Microsoft: Speeding up “Reverse ETL”
Reverse ETL is an approach to bring data from a central warehouse/lake/lakehouse into real-time operating systems such as Salesforce, Marketo, or Zendesk. The access pattern for Reverse ETL is mostly a bulk fetch and insert approach. The blog narrates how one can optimize SQL Server support reverse ETL workload.
https://medium.com/data-science-at-microsoft/speeding-up-reverse-etl-3af04e069fd1
DoubleVerify Engineering: Modernizing Data Pipelines with DBT
DoubleVerify writes about its debt adoption story and how it helped to modernize its data pipelines. TIL about Data Mock Tool (DMT), and looking forward to playing with it.
https://medium.com/doubleverify-engineering/modernizing-data-pipelines-with-dbt-c2941be74b13
Max Halford: Online gradient descent written in SQL
How far can we go with SQL? The answer is as far as possible. The author demonstrated how to implement online gradient descent in SQL.
https://maxhalford.github.io/blog/ogd-in-sql/
If you wonder what is recursive in SQL example, the article below explains a few examples of implementing recursive in SQL.
https://medium.com/swlh/recursion-in-sql-explained-graphically-679f6a0f143b
All rights reserved ProtoGrowth Inc, India. I have provided links for informational purposes and do not suggest endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.
Speaking of a SQLish DSL for data transformation, I'd recommend checking out PRQL (https://prql-lang.org), which is an interesting Rust-based project in this space.