Data Engineering Weekly #141

The Weekly Data Engineering Newsletter

Aug 06, 2023

Data Engineering Weekly Is Brought to You by RudderStack

RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles, so you can quickly ship actionable, enriched data to every downstream team. See how it works today.

Editor’s Note: DewCon.ai Registration is Now Open

Great news! We've overcome some unexpected hiccups, and guess what? Conference registration is officially OPEN! 🎉

Although we're still working on linking the payment processing with our dewcon.ai domain, we don't want to keep you waiting any longer. So, let's not wait for that and get you registered right away!

Mark your calendars: DEWCon is happening on October 12th at the luxurious Taj Hotel on MG Road. 🏨

And hold your breath for this one: Joe Reis, the author of "The Fundamentals of Data Engineering," will be giving a captivating talk at our conference! 🎤 You definitely don't want to miss this!

I can't contain my excitement; I know you can't either. So, let's meet in person at DEWCon and make this conference unforgettable! 🤝

Don't waste any more time; hit that registration button now to secure your spot. See you there! 😃🎊

Click here to Register for DEWCon - 2023 →

🤺Hot Take on The Great Orchestration Battle for Data Land ⚔️

Last week a sequence of blog posts revealed the underlying ambition of the data orchestration companies.

The first one that caught my eye is Astronomer published a blog post Introducing Cosmos 1.0: the best way to run dbt Core in Airflow. The blog narrates how to schedule dbt jobs in Airflow by parsing dbt's manifest.json file, and auto construct Airflow tasks.

The second one is from Dagster published Orchestrating dbt™ with Dagster, where Dagster narrates why Dagster provides a greater depth of orchestration features when compared to dbt-specific tools like dbt Cloud.

At Data Engineering Weekly, I wrote an article 15 months back narrating how dbt is not a separate model but part of the larger orchestration system. [See: Bundling Vs. Unbundling: The Tale of Airflow Operator and dbt Ref]. Since then, I have wondered why dbt labs' product strategy is not focused on the orchestration engine in public. From a user point-of-view, dbt labs seem comfortable using other scheduling engines, well, not anymore. I noticed an article from dbt labs that was "Seemingly easy, but in reality hard," highlighting the roadmap of the orchestration engine, yet I still need to get will it supports workloads other than dbt.

My mental model for the dbt workload is more similar to Spark's RDD [It is one of the reasons why I am excited about dbt-duckdb integration].

The goal of the Spark RDD framework is to minimize the data shuffle, as the dbt can potentially minimize the number of database calls.

The orchestration layer is the key to making this reality, and I'm thrilled that more players in this space will provide a better orchestration engine than what we have now in the industry.

Hootsuite: Automating DBT + Airflow - Deploying DBT Models at Hootsuite 🚀

Hootsuite exactly captures the current state of orchestration dbt models with Airflow!!

The solution obviously a custom parser of dbt manifest.json 😊

https://medium.com/hootsuite-engineering/automating-dbt-airflow-9efac348059d

Zendesk: dbt at Zendesk — Part 2: supercharging dbt with Dynamic Stage

Since we are at dbt, Zendesk writes about its dynamic staging layer and the auto-generation of the dbt model. One of the standard challenges in a data pipeline is de-duplication. The author narrates how Zendesk builds a generic configurable framework to de-duplicate.

https://zendesk.engineering/dbt-at-zendesk-part-2-supercharging-dbt-with-dynamic-stage-4703a49d1c30

DuckDB: DuckDB ADBC - Zero-Copy data transfer via Arrow Database Connectivity

DuchDB writes about the support for Arrow Database Connectivity (ADBC) protocol support highlighting the performance benefit from ODBC. The main difference between ADBC vs. ODBC/JDBC is that the result does not need to be transformed into a row-wise format. The performance result is impressive for ADBC integration for DuckDB integration, with 38x% faster than ODBC.

https://duckdb.org/2023/08/04/adbc.html

Square: How To Measure the Value of Internal Tools

Though the blog is not directly related to Data engineering, One of the key questions we always encounter in an organization.

How do you measure the impact of the Data Team in an organization?

The internal stakeholders are the largest user base for the data team, and the blog narrates how to measure the value of internal tools.

https://developer.squareup.com/blog/how-to-measure-the-value-of-internal-tools/

AWS: A side-by-side comparison of Apache Spark and Apache Flink for common streaming use cases

Is Flink a better choice for streaming or Spark streaming? AWS writes an excellent blog comparing the features of Apache Flink and Apache Spark for streaming use cases.

https://aws.amazon.com/blogs/big-data/a-side-by-side-comparison-of-apache-spark-and-apache-flink-for-common-streaming-use-cases/

Red Hat: How to use Kafka Cruise Control for cluster optimization

Operating a stateful application like Apache Kafka is no easy feat, and you need to automate as much as possible. Cruise Control from LinkedIn is one of my favorite tools for managing the Apache Kafka cluster. Red Hat narrates some useful automation to do with Cruise Control for Kafka cluster optimization. If you’re managing Kafka in-house, I highly recommend using Cruise Control if you’ve not already!!!

https://developers.redhat.com/articles/2023/07/05/how-use-kafka-cruise-control-cluster-optimization

Grab: Unsupervised graph anomaly detection - Catching new fraudulent behaviors.

The challenges in fraud detection arise from fraudsters continuously innovating their fraudulent behaviors, making it difficult for traditional supervised machine learning models to detect new patterns. Grab leans on an unsupervised learning model and writes about GraphBEAN, which can detect anomalous behaviors without the need for label supervision.

https://engineering.grab.com/graph-anomaly-model

Microsoft: Decoding the customer journey with graph node embeddings

Is a Graph a better data structure to model than a tabular model? I think so, so does the author. The customer journey is complex. Despite countless efforts to do so, very few have been able to successfully derive insights or predictions out of a sequence of behaviors. The blog talks about modeling customer journeys with graph node embeddings.

https://medium.com/data-science-at-microsoft/decoding-the-customer-journey-with-graph-node-embeddings-74eb983e9847

Intuit: Open Source Fuzzy-Matcher - Finding Data Similarities in Records

Fuzzy matching is a complex process in a data pipeline. Intuit writes about its open-source Fuzzy-Matcher to find similarities in records.

I think the impact of Vector Databases in similarity matching, especially in Master Data Management, is something I’m excited about. Is anyone using Vector Database for your Data Similarities work? Curious to hear your architecture on this.

https://medium.com/intuit-engineering/open-source-fuzzy-matcher-finding-data-similarities-in-records-33e4879ef4fd

Github: https://github.com/intuit/fuzzy-matcher

All rights reserved ProtoGrowth Inc, India. I have provided links for informational purposes and do not suggest endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.

Data Engineering Weekly