Data Engineering Weekly #70

Weekly Data Engineering Newsletter

Jan 17, 2022

Data Engineering Weekly - Brought to You by RudderStack - the Customer Data Platform for Developers

RudderStack Provides data pipelines that make it easy to collect data from every application, website, and SaaS platform, then activate it in your warehouse and business tools.

Let’s start this week’s edition with some excellent SQL tips.

Ergest Xheblati 🦊 @ergestx

I’ve been writing SQL for ~15 years. I’ve seen hundreds of thousands of lines of code. Over time I developed a set of patterns and best practices I always come back to when writing queries. This is my attempt to decode them 👇👇👇

Galen B: Why Google Treats SQL Like Code and You Should Too

Analytics engineering practices are becoming standard across the analytics world. The author summarizes why we should treat SQL as code and the need for a version control system for analytics.

https://blog.devgenius.io/why-google-treats-sql-like-code-and-you-should-too-53f97925037e

Vicki Boykis: Git, SQL, CLI

Continuing on the analytical engineering topic, the author writes about some of the fundamental tools you’ll require and any technical job. It's nice to see SQL top there as required, not just for analytics but for all technical jobs.

https://vickiboykis.com/2022/01/09/git-sql-cli/

You can see a similar sentiment in this tweet.

Alex Watt @alexcwatt

SQL is one of those things I wish I'd learned earlier. I underestimated how useful it is in general.

Netflix: Auto-Diagnosis and Remediation in Netflix Data Platform

Efficient feedback with the auto-remediation system can save many on-call hours. Netflix writes about the regex rule engine to diagnose the most common batch and real-time system errors.

https://netflixtechblog.com/auto-diagnosis-and-remediation-in-netflix-data-platform-5bcc52d853d1

LinkedIn: A closer look at how LinkedIn integrates fairness into its AI products

LinkedIn writes about its algorithmic fairness and explainability design to measure and mitigate unfair bias at scale. The Fairness training toolkit and the continuous feedback look to measure the success of the Fair model analyzer are exciting reads.

https://engineering.linkedin.com/blog/2022/a-closer-look-at-how-linkedin-integrates-fairness-into-its-ai-pr

DoorDash: Introducing Fabricator - A Declarative Feature Engineering Framework

We've seen an increasing pattern of adopting declarative DSL patterns for end-to-end feature engineering in real-time and batch mode. Airbnb has written about its declarative feature engineering system in the past. DoorDash writes an exciting blog describing Fabricator, its declarative feature engineering framework.

https://doordash.engineering/2022/01/11/introducing-fabricator-a-declarative-feature-engineering-framework/

Metaphor: The Modern Metadata Platform - What, Why, and How?

Support for broad integration patterns like push-pull, and most importantly, analytics on top of the metadata is vital for the modern metadata platform. Answering questions like "show me all the datasets that contain PII, accessed directly or indirectly via lineage within the last three months" are vital to gain insight into the data management system. Metaphor writes about DataHub and how it supports the modern metadata platform capabilities.

https://metaphor.io/blog/the-modern-metadata-platform

Pedram Navid: Airflow, Prefect, and Dagster: An Inside Look

Airflow is one breakthrough system that brings code as a pipeline pattern to data engineering. Since then, orchestration engines like Prefect & Dagster have taken the concept to the next level with Airflow learning. The author takes some of the pain points of running Airflow and compares it with Dagster & Prefect.

https://towardsdatascience.com/airflow-prefect-and-dagster-an-inside-look-6074781c9b77

I shared my thoughts on the DAG model of data pipeline is obsolete here.

Ananth Packkildurai @ananthdurai

The future of DAG is NO DAG!!! The DAG approach for the data pipeline taking the focus off from data asset & asset lifecycle. Currently we are mitigating the asset mgmt with auxiliary systems like lineage & discovery which runs on asset model, not DAGs.

Trip.com: StarRocks efficiently supports high concurrent queries, dramatically reducing labor and hardware costs.

Trip.com writes about its expereince switching from ClickHouse to StarRocks for their real-time analytical database. TIL about Star rocks, and seems like an exciting system. I’m continuing to hear more performance issues with ClickHouse, and am curious to know folks experience with ClickHouse. Please DM @data_weekly if you’re using ClickHouse in production.

https://starrocks.medium.com/trip-com-starrocks-efficiently-supports-high-concurrent-queries-dramatically-reduces-labor-and-1e1921dd6bf8

CIDR : CIDR Conference 2022

Lastly, I enjoyed attending CIDR 2022 last week, and was delighted to see the latest research on the data ecosystem. All the CIDR talks and papers published here

http://cidrdb.org/cidr2022/program.html

Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.

Data Engineering Weekly