Data Engineering Weekly #58

Weekly Data Engineering Newsletter

Oct 04, 2021

Data Engineering Weekly - Brought to You by RudderStack - the Customer Data Platform for Developers

RudderStack Provides data pipelines that make it easy to collect data from every application, website, and SaaS platform, then activate it in your warehouse and business tools.

Event: Join Impact 2021 on November 3, 2021: The First-Ever Data Observability Summit. Join Today's Leading Data Pioneers

Hear from data leaders pioneering the technologies & processes shaping data engineering. Featuring First Chief Data Scientist of the U.S., founder of the Data Mesh, and many more!

Click To Get Your Free Ticket For All Data Engineering Weekly Readers

O'Reilly: 2021 Data/AI Salary Survey

O'Reilly published a comprehensive salary report for data & AI professionals. The sad trend continues where Women's salaries were sharply lower than men's salaries, averaging $126,000 annually, or 84% of the average salary for men ($150,000). Python, SQL & JavaScript are the top 3 most popular programming languages for data & AI.

https://www.oreilly.com/radar/2021-data-ai-salary-survey/

Vivian Guo: Make Machine Learning Work for Your Company - A Primer

Software-driven industrialization is moving from process-based workflow to an ML/AI-driven workflow. But how do you build a machine learning team? And what does this mean for software companies? The author walks through how to start a machine learning team, hiring & tracking the impact.

https://medium.com/iconiq-growth/make-machine-learning-work-for-your-company-a-primer-f68ad0b1cd40

Matt Turck: Red Hot - The 2021 Machine Learning, AI and Data (MAD) Landscape

Matt Turck published a comprehensive list of Machine Learning, AI & Data (MAD!!!) landscape. One interesting fact that I noticed in the landscape is that Jupiter notebooks still own the collaboration space. Data Engineering is inherently social & collaborative work across the org, and I can see this collaboration space still wide open.

https://mattturck.com/data2021/

Pinterest: Ensuring High Availability of Ads Realtime Streaming Services

Pinterest writes about its high available ads real-time streaming services on Apache Flink & Kafka stream. The hot-hot primary & standby pipeline for each service is an exciting design to read.

https://medium.com/pinterest-engineering/ensuring-high-availability-of-ads-realtime-streaming-services-ea3889420490

LinkedIn: Distributed tier merge: How LinkedIn tackles stragglers in search index build

LinkedIn writes about distributed tier merge in building offline search index using Apache Spark. The migration from MapReduce to Spark & distributed tier merge improved the build time by 40% across the product!!

https://engineering.linkedin.com/blog/2021/distributed-tier-merge

DoorDash: How to Run Apache Airflow on Kubernetes at Scale

DoorDash writes an exciting blog narrating its migration of Airflow from a single instance infrastructure to KubernetesPodOperators. The blog states the higher memory availability of the Airflow scheduler after offloading the operator workloads to Kubernetes.

https://doordash.engineering/2021/09/28/how-to-run-apache-airflow-on-kubernetes-at-scale/

Airbnb: The Airflow Smart Sensor Service

Airflow poking sensor implementation is a resource-intensive operator that will keep running until the specified condition is satisfied. Airbnb writes about the impact of smart sensors on its Airflow infrastructure. With deduplication, it reduces 40% of the load from the Hive meta store.

https://medium.com/airbnb-engineering/the-airflow-smart-sensor-service-221f96227bcb

Storyblocks: Blue-Green ETLs with Airflow Task Groups

Storyblocks writes about adopting the Blue-Green ETL model with Airflow on its Redshift data warehouse. The load and swap in the mutable pipeline is always a challenge, and it's great to see the Blue-Green deployment pattern adoption.

https://medium.com/storyblocks-engineering/blue-green-etls-with-airflow-task-groups-71c36d120c2e

Wealthfront: Automating Data Quality Checks on External Data

Data pipeline on top of the external, uncontrolled datasets can be challenging. Wealthfront writes about its data quality approach following persisting the raw data, transforms to a confirmed schema and validate, and handles the anomalies.

https://eng.wealthfront.com/2021/09/28/automating-data-quality-checks-on-external-data/

Teads: Managing a BigQuery data warehouse at scale

Teads published helpful tips and tools to manage BigQuery to resolve slow-running queries and improve slot usage and table size. The BqVisualiser looks like an exciting tool to visualize and optimize the query performance.ff

https://medium.com/teads-engineering/managing-a-bigquery-data-warehouse-at-scale-e6ec9a8406b2

Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers' opinions.

Data Engineering Weekly

Discussion about this post

Ready for more?