Data Engineering Weekly #79

Weekly Data Engineering Newsletter

Mar 21, 2022

Data Engineering Weekly Is Brought to You by RudderStack

RudderStack Provides data pipelines that make it easy to collect data from every application, website, and SaaS platform, then activate it in your warehouse and business tools. Sign up free to test out the tool today.

Saurav Jain: Seven Python Libraries for Machine Learning

This week, let’s start with this handy mini-introduction to seven python libraries for machine learning.

Saurav Jain @Sauain

Small introduction of SEVEN Python Libraries for Machine Learning 🤖 Thread 🧵👇

Atlan: The Great Data Debate - Unbundling vs. Bundling of the Modern Data Stack

The data community had an interesting conversation around bundling vs. unbundling. I had a fun conversation with Gorkem Yurtseven, Prukalpa, and Nick Schrock.

, and here is the exciting poll result.

Stanford HAI: The State of AI in 9 Charts

Stanford HAI writes about the current state of AI, highlighting the diversity challenges, booming private investment, and AI patent fillings.

https://hai.stanford.edu/news/state-ai-9-charts

Spotify: Why We Switched Our Data Orchestration Service

Luigi, one of the pioneers in the data orchestration engine leaving its home. Spotify writes about its plan to move to Flyte over Luigi.

Is that official EOL for Luigi?

https://engineering.atspotify.com/2022/03/why-we-switched-our-data-orchestration-service/

LinkedIn: Opal - Building a mutable dataset in data lake

Handling the mutability of the data is a critical aspect of the data infrastructure. The LakeHouse framework adopts incremental update as a core design principle. LinkedIn writes about Opal design and how it facilitates building a mutable dataset in the data lake.

https://engineering.linkedin.com/blog/2022/opal--building-a-mutable-dataset-in-data-lake

Twitter: Graph machine learning with missing node features

Partial data unavailability is a cause of ML model failure in most cases. Twitter writes about an efficient, scalable approach for handling missing features in graph machine learning applications.

https://blog.twitter.com/engineering/en_us/topics/insights/2022/graph-machine-learning-with-missing-node-features

Grab: Real-time data ingestion in Grab

Change Data Capture with the transactional outbox pattern, and Debezium becomes a standard approach for event sourcing. Grab writes about its design of real-time data ingestion.

https://engineering.grab.com/real-time-data-ingestion

Stitch Fix: Migrating Spark from EMR on EC2 to EMR on EKS

I write about the curious case of the AWS EMR pricing model highlighting the EMR surcharge pricing model. Stitch Fix writes about other operational issues such as non-adoptive provisioning, Spark multi-version support, scattered observability, and the lack of configuration agility. Stitch Fix is mitigating the complexity of running EMR on EKS;

I think it is time to rethink Yarn for the cloud. I don't think running one scheduler over the other is an optimal solution.

https://multithreaded.stitchfix.com/blog/2022/03/14/spark-eks/

AWS: Best practices for right-sizing your Apache Kafka clusters to optimize performance and cost

Cost optimization is top of the mind for many companies; AWS writes about approaching performance & cost optimization with tips for right-sizing Kafka cluster, network throughput, and monitoring continuous optimization.

https://aws.amazon.com/blogs/big-data/best-practices-for-right-sizing-your-apache-kafka-clusters-to-optimize-performance-and-cost/

Whatnot: Building a Modern Data Stack at Whatnot

WhatNot writes about its adoption of the modern data stack. We highlighted the tale of Airflow operator vs. dbt, and Whatnot article reflects the same.

Second, for our data transformation layer, we rely on two tools: Apache Airflow and DBT. For any data replication work (where we transform data before it is loaded into the data warehouse), we write the logic in Python that transforms and loads the data. We also use Airflow to orchestrate training machine-learning models.

It will be exciting to see how orchestration and data transformation co-exist or merge at some point. One thing I'm not clear in the blog is that there is "No dimensional data model for now"? Does that mean throwing JSON data into the data warehouse or not adopting the confirmed dimensions style data modeling?

https://medium.com/whatnot-engineering/building-a-modern-data-stack-at-whatnot-afc1d03c3f9

ManyPets: How ManyPets Implemented The Modern Data Stack

Staying on the modern data stack, ManyPets writes about its data stack, highlighting its choice of tools. A couple of key themes

Airflow + dbt as two orchestration engines for data transformation is a common approach.

Before dbt, we stored our modeling queries as SQL files and ran them as tasks in an Airflow job. We had to manually add dependencies between the tasks and we used Python’s string formatting to allow some basic code reuse. Initially, I’m now embarrassed to say, I didn’t think we needed dbt to help manage this. As we grew further though it became clear that we did and now I can’t imagine life without it!

Data lineage & discovery tooling doesn’t play a significant role in the beginning stage of the modern data stack.

https://medium.com/data-manypets/how-manypets-implemented-the-modern-data-stack-35877715c0da

Emily Thompson: Growing Data Teams from Reactive to Influential

We sometimes joke about data engineering as the backend of the backend. The author narrates three stages of the data team’s maturity model—a highly recommended read on how to mature the data function in an organization.

Reactive stage
Proactive stage
Influential stage.

https://scientistemily.substack.com/p/reactive-proactive-influential

Madison Schott: Prevent Data Loss With This Free dbt Package

TIL about re-data, an open-source data reliability framework for observing the dbt projects. The author narrates an introduction to use re-data with the dbt project.

Github: https://github.com/re-data/re-data

https://towardsdatascience.com/prevent-data-loss-with-this-free-dbt-package-a676c2e59c97

All rights reserved Pixel Impex Inc, India. Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.

Data Engineering Weekly