Data Engineering Weekly #56

Weekly Data Engineering Newsletter

Data Engineering Weekly - Brought to You by RudderStack - the Customer Data Platform for Developers

RudderStack Provides data pipelines that make it easy to collect data from every application, website, and SaaS platform, then activate it in your warehouse and business tools.

Event: Join Impact 2021 on November 3, 2021: The First-Ever Data Observability Summit. Join Today's Leading Data Pioneers

Hear from data leaders pioneering the technologies & processes shaping data engineering. Featuring First Chief Data Scientist of the U.S., founder of the Data Mesh, and many more!

Click To Get Your Free Ticket For All Data Engineering Weekly Readers

Benn Stancil: The Data OS

Y Combinator—an incubator of both startups and the Silicon Valley zeitgeist—funded 15 analytics, data engineering, and AI and ML companies. In 2021, they funded 100. Does the modern data stack bring too many tools to the table to solve the data problem? Benn Stancil is discussing data OS.

Data Engineering - UC Berkeley, Spring 2021

UC Berkeley published its spring 2021 data engineering course slides and resources. It is excellent learning material for data engineering practitioners.

Airbnb: Automating Data Protection at Scale

Data protection and privacy monitoring is a critical aspect of the data management platform. It is the most challenging aspect of data management since it can travel through multiple data storages, making it harder to keep track of manually. Airbnb writes about Madoka, a metadata system for data protection that maintains the security and privacy-related metadata for all data assets on the Airbnb platform.

Uber: YAML Generator for Funnel YAML Files: Streamlining the Mobile Data Workflow Process

Funnel analysis is a critical analytical feature from click tracking events. Uber writes an exciting blog about YAML generators, followed by a simple UI workflow engine to develop funnel analysis. It triggers an interesting data pipeline debate, no-code or code-only data pipeline. IMO, the answer is to know your audience and their workflow to make them productive.

Intuit: A Paved Road for Data Pipelines

Intuit writes about a general overview of its data infrastructure, emphasizing that lack of standardization can lead to fragmentation and islands of computing. The blog narrates Intuit's developer portal and UI-driven pipeline lifecycle management platform.

Sponsored: RudderStack - Churn Prediction With BigQueryML to Increase Mobile Game Revenue

Here’s an interesting case study on how machine learning can directly impact the bottom line. RudderStack writes an outline of how app developers, Torpedo Labs, use BigQuery ML to identify high-value mobile game players who are dangerously close to churning.

Pinterest: Faster Flink adoption with self-service diagnosis tool at Pinterest

Self-serving diagnostic tooling is a vital part of the data platform for democratizing the adoption. Pinterest writes about Dr. Squirrel, a Flink logs aggregator to perform job health checks, flag unhealthy jobs explicitly, and provide root cause analysis and actionable steps to help fix the issues.

Cloudera: Operating Apache Kafka with Cruise Control

Cruise control is one of my favorite tools to operate Apache Kafka at scale. Cloudera writes an exciting blog giving an overview of Cruise Control and its use cases.

AutoTrader: Auto-generating an Airflow DAG using the dbt manifest

It is always challenging to integrate Airflow as a task dependency system with Dbt, a model-dependent system. AutoTrader writes an exciting blog about its DbtTaskGenerator to auto-generate Airflow DAGs using Dbt's manifest files.

Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers' opinions.