Data Engineering Weekly

Share this post

Data Engineering Weekly #122

www.dataengineeringweekly.com

Data Engineering Weekly #122

The Weekly Data Engineering Newsletter

Ananth Packkildurai
Mar 13
3
2
Share this post

Data Engineering Weekly #122

www.dataengineeringweekly.com

Contribute to the Rudderstack Transformations Library, Win $1000

RudderStack Transformations lets you customize event data in real time with your own JavaScript or Python code. Now you can win $1,000 cash by contributing a Transformation to our open-source library.

https://www.rudderstack.com/blog/join-the-transformations-challenge-for-a-chance-to-win/


Editor’s Note: Data Engineering Radio

At Data Engineering Weekly, We strive to bring the best thought process around building and operating data. However, the newsletter has its limitation. When I read articles for Data Engineering Weekly, there are many instances I say to myself, this is amazing work; I want to talk to the author to learn more from them or discuss it.

With this selfish quest for knowing more, Ashwin & I started a new Podcast. We call it Data Engineering Weekly Radio. We take three articles and do an in-depth analysis, and we hope to bring the author of the blogs to discuss more.

We will publish all the podcasts in Substack. You can also listen to the podcast in

Apple

Spotify:

Please share your feedback on how we can improve the show further.


Pedram Navid: dbt Reimagined

The author writes about reimagining dbt and suggesting the possible area of improvement in dbt's core product.

  1. DSLs over Templated Code

  2. Debuggers [make it easy to debug with code editor breakpoints]

  3. Unit tests

I can’t emphasize the importance of DSLs over templated code. The advantage of DSL is that you can make more productivity improvements with compilers such as type safety, debuggers, etc. I expressed a similar thought sometime back.

Please reach out if you come across any SQLish DSL for data transformation; I would love to discuss it more, and I hope that will help Pedram to sleep better 😀

https://pedram.substack.com/p/dbt-reimagined


Max Illis: On Data Products and how to describe them

There is an increasingly healthy conversation about treating Data as a Product to bring product thinking to Data Asset Creation & Lifecycle management process. Max Illis, a leading thought leader in this space, describes the importance of the Data Product approach and the detailed information checklist to publish, discover and manage a data product.

https://medium.com/@maxillis/on-data-products-and-how-to-describe-them-76ae1b7abda4


Brex: Change Data Capture at Brex

Brex writes an in-depth article about the technical implementation of its CDC pipeline combined with transaction event publishing with an outbox pattern. The blog narrates the architecture to implement an outbox pattern with Debezium, the usage of the outbox router in Debezium, and lessons learned.

https://medium.com/brexeng/change-data-capture-at-brex-c71263616dd7


Uber: D3- An Automated System to Detect Data Drifts

No Data! No Problem, But Partial Data a big problem

Uber highlighted how the partial data caused almost half of their data issues. Uber writes about D3 - an automated system to detect data drift. The blog highlights some common problems with data drift and D3 architecture to detect and alter data drift.

https://www.uber.com/en-US/blog/d3-an-automated-system-to-detect-data-drifts/


Sponsored: [New Guide] The Ultimate Guide to Data Mesh Architecture

If implementing data mesh is high on your list of priorities, you’re not alone. As organizations scale their use of data, centralized architectures can prevent data teams from keeping pace with stakeholder demands and system needs. In this guide, learn through strategies deployed by leading data teams that have successfully implemented data mesh.
Get The Guide


Grab: Migrating from Role to Attribute-based Access Control

Privacy and access control are the basic components of data engineering. Grab writes about switching access control from a role-based approach to an attribute-based model. The blog highlights some of the practical limitations of role-based access

  1. too many roles in controlling the access

  2. increase backlog due to the managerial approval process for granting a role

  3. Stale group membership gives access to members that they should not have. 

https://engineering.grab.com/migrating-to-abac


IBM: Feature Platforms — A New Paradigm in Machine Learning Operations (MLOps)

AI/ ML systems came a long way from the McKinsey study of an estimated 88% of machine learning models that were never taken into production in 2017. Feature stores are on the rise, and the author narrates what is feature platform is and the components of a feature platform.

  1. Feature design

  2. Feature catalog

  3. Feature computation engine

  4. Feature governance

  5. Feature monitoring

https://medium.com/ibm-data-ai/feature-platforms-a-new-paradigm-in-machine-learning-operations-mlops-24c1ff87b7e1


Sponsored: RudderStack Transformations - Move Faster and Build Data Trust

With Device Mode Transformations, you can transform data sent to downstream integrations running in device mode. When destination integrations are set up in device mode, RudderStack loads that tool's native SDK asynchronously and sends event data directly to the destination from the device itself (i.e., from the browser or mobile app).

RudderStack Product manager, Badri Veeraragavan, details a few big updates to RudderStack's beloved data transformation feature. New features include Python Transformations (including Libraries and Transformations API), Transformation Templates, and Device Mode Transformations. 75% of RudderStack users already leverage Transformations, and now they're even more powerful.


https://www.rudderstack.com/blog/transformations-move-faster-and-build-data-trust/


Tomas Fernandez: How to Manage Databases with CI/CD

Can we apply software engineering principles in managing databases? Though the blog narrates how to adopt CI/CD in managing operational data stores, many ideas expressed in the blog apply to data pipelines. Some of my favorites

  1. Commit database scripts to version control

  2. Decouple deployment from data migrations

  3. Keep changes small

  4. Make migrations additive

  5. Consider blue-green deployments

https://hackernoon.com/how-to-manage-databases-with-cicd


LinkedIn: Reducing Apache Spark Application Dependencies Upload by 99%

The fat jar artifact has its latency toll to pay while deploying the Spark job. The obvious choice is to use caching for the libraries to reduce the upload time. Should it be user-level caching or cluster-level caching? LinkedIn writes about dependency caching solutions and why they adopted user-level caching instead of cluster-level.

https://engineering.linkedin.com/blog/2023/reducing-apache-spark-application-dependencies-upload-by-99-


Microsoft: Speeding up “Reverse ETL”

Reverse ETL is an approach to bring data from a central warehouse/lake/lakehouse into real-time operating systems such as Salesforce, Marketo, or Zendesk. The access pattern for Reverse ETL is mostly a bulk fetch and insert approach. The blog narrates how one can optimize SQL Server support reverse ETL workload.

https://medium.com/data-science-at-microsoft/speeding-up-reverse-etl-3af04e069fd1


DoubleVerify Engineering: Modernizing Data Pipelines with DBT

DoubleVerify writes about its debt adoption story and how it helped to modernize its data pipelines. TIL about Data Mock Tool (DMT), and looking forward to playing with it.

https://medium.com/doubleverify-engineering/modernizing-data-pipelines-with-dbt-c2941be74b13


Max Halford: Online gradient descent written in SQL

How far can we go with SQL? The answer is as far as possible. The author demonstrated how to implement online gradient descent in SQL.

https://maxhalford.github.io/blog/ogd-in-sql/

If you wonder what is recursive in SQL example, the article below explains a few examples of implementing recursive in SQL.

https://medium.com/swlh/recursion-in-sql-explained-graphically-679f6a0f143b


All rights reserved ProtoGrowth Inc, India. I have provided links for informational purposes and do not suggest endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.

2
Share this post

Data Engineering Weekly #122

www.dataengineeringweekly.com
2 Comments
Chris Redwine
Mar 13Liked by Ananth Packkildurai

Speaking of a SQLish DSL for data transformation, I'd recommend checking out PRQL (https://prql-lang.org), which is an interesting Rust-based project in this space.

Expand full comment
Reply
1 reply by Ananth Packkildurai
1 more comment…
TopNewCommunity

No posts

Ready for more?

© 2023 Ananth Packkildurai
Privacy ∙ Terms ∙ Collection notice
Start WritingGet the app
Substack is the home for great writing