Data Engineering Weekly - Brought to You by RudderStack - the Customer Data Platform for Developers
RudderStack
Provides data pipelines that make it easy to collect data from every application, website, and SaaS platform, then activate it in your warehouse and business tools.
Let’s start this week’s edition with some excellent SQL tips.
Galen B: Why Google Treats SQL Like Code and You Should Too
Analytics engineering practices are becoming standard across the analytics world. The author summarizes why we should treat SQL as code and the need for a version control system for analytics.
https://blog.devgenius.io/why-google-treats-sql-like-code-and-you-should-too-53f97925037e
Vicki Boykis: Git, SQL, CLI
Continuing on the analytical engineering topic, the author writes about some of the fundamental tools you’ll require and any technical job. It's nice to see SQL top there as required, not just for analytics but for all technical jobs.
https://vickiboykis.com/2022/01/09/git-sql-cli/
You can see a similar sentiment in this tweet.
Netflix: Auto-Diagnosis and Remediation in Netflix Data Platform
Efficient feedback with the auto-remediation system can save many on-call hours. Netflix writes about the regex rule engine to diagnose the most common batch and real-time system errors.
https://netflixtechblog.com/auto-diagnosis-and-remediation-in-netflix-data-platform-5bcc52d853d1
LinkedIn: A closer look at how LinkedIn integrates fairness into its AI products
LinkedIn writes about its algorithmic fairness and explainability design to measure and mitigate unfair bias at scale. The Fairness training toolkit and the continuous feedback look to measure the success of the Fair model analyzer are exciting reads.
DoorDash: Introducing Fabricator - A Declarative Feature Engineering Framework
We've seen an increasing pattern of adopting declarative DSL patterns for end-to-end feature engineering in real-time and batch mode. Airbnb has written about its declarative feature engineering system in the past. DoorDash writes an exciting blog describing Fabricator, its declarative feature engineering framework.
Metaphor: The Modern Metadata Platform - What, Why, and How?
Support for broad integration patterns like push-pull, and most importantly, analytics on top of the metadata is vital for the modern metadata platform. Answering questions like "show me all the datasets that contain PII, accessed directly or indirectly via lineage within the last three months" are vital to gain insight into the data management system. Metaphor writes about DataHub
and how it supports the modern metadata platform capabilities.
https://metaphor.io/blog/the-modern-metadata-platform
Sponsored: New Year, Better Event Data with Avo & Rudderstack
Join RudderStack and Avo for a live webinar on January 27 @ 9am PT to learn how you can increase your event data quality and streamline your behavioral data pipelines.
https://www.avo.app/event-driven-infrastructure-webinar
Pedram Navid: Airflow, Prefect, and Dagster: An Inside Look
Airflow is one breakthrough system that brings code as a pipeline pattern to data engineering. Since then, orchestration engines like Prefect & Dagster have taken the concept to the next level with Airflow learning. The author takes some of the pain points of running Airflow and compares it with Dagster & Prefect.
https://towardsdatascience.com/airflow-prefect-and-dagster-an-inside-look-6074781c9b77
I shared my thoughts on the DAG model of data pipeline is obsolete here.
Trip.com: StarRocks efficiently supports high concurrent queries, dramatically reducing labor and hardware costs.
Trip.com writes about its expereince switching from ClickHouse to StarRocks for their real-time analytical database. TIL about Star rocks, and seems like an exciting system. I’m continuing to hear more performance issues with ClickHouse, and am curious to know folks experience with ClickHouse. Please DM @data_weekly if you’re using ClickHouse in production.
CIDR : CIDR Conference 2022
Lastly, I enjoyed attending CIDR 2022 last week, and was delighted to see the latest research on the data ecosystem. All the CIDR talks and papers published here
http://cidrdb.org/cidr2022/program.html
Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.