Data Engineering Weekly Is Brought to You by RudderStack
RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles, so you can quickly ship actionable, enriched data to every downstream team. See how it works today.
Editor’s Note: DewCon.ai Registration is Now Open
Great news! We've overcome some unexpected hiccups, and guess what? Conference registration is officially OPEN! 🎉
Although we're still working on linking the payment processing with our dewcon.ai domain, we don't want to keep you waiting any longer. So, let's not wait for that and get you registered right away!
Mark your calendars: DEWCon is happening on October 12th at the luxurious Taj Hotel on MG Road. 🏨
And hold your breath for this one: Joe Reis, the author of "The Fundamentals of Data Engineering," will be giving a captivating talk at our conference! 🎤 You definitely don't want to miss this!
I can't contain my excitement; I know you can't either. So, let's meet in person at DEWCon and make this conference unforgettable! 🤝
Don't waste any more time; hit that registration button now to secure your spot. See you there! 😃🎊
Click here to Register for DEWCon - 2023 →
🤺Hot Take on The Great Orchestration Battle for Data Land ⚔️
Last week a sequence of blog posts revealed the underlying ambition of the data orchestration companies.
The first one that caught my eye is Astronomer published a blog post Introducing Cosmos 1.0: the best way to run dbt Core in Airflow. The blog narrates how to schedule dbt jobs in Airflow by parsing dbt's manifest.json file, and auto construct Airflow tasks.
The second one is from Dagster published Orchestrating dbt™ with Dagster, where Dagster narrates why Dagster provides a greater depth of orchestration features when compared to dbt-specific tools like dbt Cloud.
At Data Engineering Weekly, I wrote an article 15 months back narrating how dbt is not a separate model but part of the larger orchestration system. [See: Bundling Vs. Unbundling: The Tale of Airflow Operator and dbt Ref]. Since then, I have wondered why dbt labs' product strategy is not focused on the orchestration engine in public. From a user point-of-view, dbt labs seem comfortable using other scheduling engines, well, not anymore. I noticed an article from dbt labs that was "Seemingly easy, but in reality hard," highlighting the roadmap of the orchestration engine, yet I still need to get will it supports workloads other than dbt.
My mental model for the dbt workload is more similar to Spark's RDD [It is one of the reasons why I am excited about dbt-duckdb integration].
The goal of the Spark RDD framework is to minimize the data shuffle, as the dbt can potentially minimize the number of database calls.
The orchestration layer is the key to making this reality, and I'm thrilled that more players in this space will provide a better orchestration engine than what we have now in the industry.
Hootsuite: Automating DBT + Airflow - Deploying DBT Models at Hootsuite 🚀
Hootsuite exactly captures the current state of orchestration dbt models with Airflow!!
The solution obviously a custom parser of dbt manifest.json 😊
https://medium.com/hootsuite-engineering/automating-dbt-airflow-9efac348059d
Zendesk: dbt at Zendesk — Part 2: supercharging dbt with Dynamic Stage
Since we are at dbt, Zendesk writes about its dynamic staging layer and the auto-generation of the dbt model. One of the standard challenges in a data pipeline is de-duplication. The author narrates how Zendesk builds a generic configurable framework to de-duplicate.
https://zendesk.engineering/dbt-at-zendesk-part-2-supercharging-dbt-with-dynamic-stage-4703a49d1c30
Sponsored: Great Data Debate–The State of Data Mesh
Since 2019, the data mesh has woven itself into every blog post, event presentation, and webinar. But 4 years later, in 2023 — where has the data mesh gotten us? Does its promise of a decentralized dreamland hold true?
Atlan is bringing together data leaders like Abhinav Sivasailam (CEO, Levers Labs), Barr Moses (Co-founder & CEO, Monte Carlo), Scott Hirleman (Founder & CEO, Data Mesh Understanding), Teresa Tung (Cloud First Chief Technologist, Accenture), Tristan Handy (Founder & CEO, dbt Labs), Prukalpa Sankar (Co-founder, Atlan), and more at the next edition of the Great Data Debate to discuss the state of data mesh – tech toolkit and cultural shift required to implement data mesh.
Learn more and sign up to join the Great Data Debate on August 16 →
DuckDB: DuckDB ADBC - Zero-Copy data transfer via Arrow Database Connectivity
DuchDB writes about the support for Arrow Database Connectivity (ADBC) protocol support highlighting the performance benefit from ODBC. The main difference between ADBC vs. ODBC/JDBC is that the result does not need to be transformed into a row-wise format. The performance result is impressive for ADBC integration for DuckDB integration, with 38x% faster than ODBC.
https://duckdb.org/2023/08/04/adbc.html
Square: How To Measure the Value of Internal Tools
Though the blog is not directly related to Data engineering, One of the key questions we always encounter in an organization.
How do you measure the impact of the Data Team in an organization?
The internal stakeholders are the largest user base for the data team, and the blog narrates how to measure the value of internal tools.
https://developer.squareup.com/blog/how-to-measure-the-value-of-internal-tools/
Sponsored: [New eBook] Gartner Innovation Insight: Data Observability Enables Proactive Data Quality
In 2023, data observability is a must-have for companies seeking to reduce time, resources, and budget spent firefighting unreliable or anomalous data while unlocking new opportunities to cut costs and drive growth. Not sure where to start or what to look for in a data observability tool? Look no further than the Gartner latest report.
AWS: A side-by-side comparison of Apache Spark and Apache Flink for common streaming use cases
Is Flink a better choice for streaming or Spark streaming? AWS writes an excellent blog comparing the features of Apache Flink and Apache Spark for streaming use cases.
Red Hat: How to use Kafka Cruise Control for cluster optimization
Operating a stateful application like Apache Kafka is no easy feat, and you need to automate as much as possible. Cruise Control from LinkedIn is one of my favorite tools for managing the Apache Kafka cluster. Red Hat narrates some useful automation to do with Cruise Control for Kafka cluster optimization. If you’re managing Kafka in-house, I highly recommend using Cruise Control if you’ve not already!!!
https://developers.redhat.com/articles/2023/07/05/how-use-kafka-cruise-control-cluster-optimization
Sponsored: Real-Time Event Streaming: RudderStack vs. Apache Kafka
If your operations involve managing large-scale databases and you can dedicate resources for Kafka's administration and maintenance, then Kafka is your tool. RudderStack, on the other hand, is an excellent choice for businesses that are looking for a streamlined solution to collect, unify, and activate customer data from various sources to various destinations without Kafka's complexities.
Here’s an interesting breakdown of event streaming approaches that compare RudderStack to Apache Kafka. While it may be comparing apples and oranges, the two platforms can be used to achieve the same ends, and this piece provides a helpful framework to determine when it's appropriate to use each tool.
https://www.rudderstack.com/blog/real-time-event-streaming-rudderstack-vs-apache-kafka/
Grab: Unsupervised graph anomaly detection - Catching new fraudulent behaviors.
The challenges in fraud detection arise from fraudsters continuously innovating their fraudulent behaviors, making it difficult for traditional supervised machine learning models to detect new patterns. Grab leans on an unsupervised learning model and writes about GraphBEAN, which can detect anomalous behaviors without the need for label supervision.
https://engineering.grab.com/graph-anomaly-model
Microsoft: Decoding the customer journey with graph node embeddings
Is a Graph a better data structure to model than a tabular model? I think so, so does the author. The customer journey is complex. Despite countless efforts to do so, very few have been able to successfully derive insights or predictions out of a sequence of behaviors. The blog talks about modeling customer journeys with graph node embeddings.
Intuit: Open Source Fuzzy-Matcher - Finding Data Similarities in Records
Fuzzy matching is a complex process in a data pipeline. Intuit writes about its open-source Fuzzy-Matcher to find similarities in records.
I think the impact of Vector Databases in similarity matching, especially in Master Data Management, is something I’m excited about. Is anyone using Vector Database for your Data Similarities work? Curious to hear your architecture on this.
Github: https://github.com/intuit/fuzzy-matcher
All rights reserved ProtoGrowth Inc, India. I have provided links for informational purposes and do not suggest endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.
Thanks for the weekly update. Any alternative to fuzzy-matcher in Python? (similar features)