Data Engineering Weekly Is Brought to You by RudderStack
RudderStack Provides data pipelines that make it easy to collect data from every application, website, and SaaS platform, then activate it in your warehouse and business tools. Sign up free to test out the tool today.
Saurav Jain: Seven Python Libraries for Machine Learning
This week, let’s start with this handy mini-introduction to seven python libraries for machine learning.
Atlan: The Great Data Debate - Unbundling vs. Bundling of the Modern Data Stack
, and here is the exciting poll result.
Stanford HAI: The State of AI in 9 Charts
Stanford HAI writes about the current state of AI, highlighting the diversity challenges, booming private investment, and AI patent fillings.
Spotify: Why We Switched Our Data Orchestration Service
Luigi, one of the pioneers in the data orchestration engine leaving its home. Spotify writes about its plan to move to Flyte over Luigi.
Is that official EOL for Luigi?
LinkedIn: Opal - Building a mutable dataset in data lake
Handling the mutability of the data is a critical aspect of the data infrastructure. The LakeHouse framework adopts incremental update as a core design principle. LinkedIn writes about Opal design and how it facilitates building a mutable dataset in the data lake.
Twitter: Graph machine learning with missing node features
Partial data unavailability is a cause of ML model failure in most cases. Twitter writes about an efficient, scalable approach for handling missing features in graph machine learning applications.
Grab: Real-time data ingestion in Grab
Change Data Capture with the transactional outbox pattern, and Debezium becomes a standard approach for event sourcing. Grab writes about its design of real-time data ingestion.
Sponsored: Making Data Engineering Easier - Operational Analytics With Event Streaming and Reverse ETL
When it comes to Reverse ETL, business use cases typically get all the attention. Here, RudderStack focuses on how reverse ETL makes data engineering easier. They drive the point home with an example from their own data engineering team that involved using the Google Click ID (gclid) to get enriched conversions into Google Ads.
Stitch Fix: Migrating Spark from EMR on EC2 to EMR on EKS
I write about
the curious case of the AWS EMR pricing model
I think it is time to rethink Yarn for the cloud. I don't think running one scheduler over the other is an optimal solution.
AWS: Best practices for right-sizing your Apache Kafka clusters to optimize performance and cost
Cost optimization is top of the mind for many companies; AWS writes about approaching performance & cost optimization with tips for right-sizing Kafka cluster, network throughput, and monitoring continuous optimization.
Whatnot: Building a Modern Data Stack at Whatnot
WhatNot writes about its adoption of the modern data stack. We highlighted the tale of Airflow operator vs. dbt, and Whatnot article reflects the same.
Second, for our data transformation layer, we rely on two tools: Apache Airflow and DBT. For any data replication work (where we transform data before it is loaded into the data warehouse), we write the logic in Python that transforms and loads the data. We also use Airflow to orchestrate training machine-learning models.
It will be exciting to see how orchestration and data transformation co-exist or merge at some point. One thing I'm not clear in the blog is that there is "No dimensional data model for now"? Does that mean throwing JSON data into the data warehouse or not adopting the confirmed dimensions style data modeling?
ManyPets: How ManyPets Implemented The Modern Data Stack
Staying on the modern data stack, ManyPets writes about its data stack, highlighting its choice of tools. A couple of key themes
Airflow + dbt as two orchestration engines for data transformation is a common approach.
Before dbt, we stored our modeling queries as SQL files and ran them as tasks in an Airflow job. We had to manually add dependencies between the tasks and we used Python’s string formatting to allow some basic code reuse. Initially, I’m now embarrassed to say, I didn’t think we needed dbt to help manage this. As we grew further though it became clear that we did and now I can’t imagine life without it!
Data lineage & discovery tooling doesn’t play a significant role in the beginning stage of the modern data stack.
Emily Thompson: Growing Data Teams from Reactive to Influential
We sometimes joke about data engineering as the backend of the backend. The author narrates three stages of the data team’s maturity model—a highly recommended read on how to mature the data function in an organization.
Madison Schott: Prevent Data Loss With This Free dbt Package
TIL about re-data, an open-source data reliability framework for observing the dbt projects. The author narrates an introduction to use re-data with the dbt project.
All rights reserved Pixel Impex Inc, India. Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.