Data Engineering Weekly - Brought to You by RudderStack - the Customer Data Platform for Developers
RudderStack
Provides data pipelines that make it easy to collect data from every application, website, and SaaS platform, then activate it in your warehouse and business tools.
Event: Join Impact 2021 on November 3, 2021: The First-Ever Data Observability Summit. Join Today's Leading Data Pioneers
Hear from data leaders pioneering the technologies & processes shaping data engineering. Featuring First Chief Data Scientist of the U.S., founder of the Data Mesh, and many more!
Click To Get Your Free Ticket For All Data Engineering Weekly Readers
O'Reilly: 2021 Data/AI Salary Survey
O'Reilly published a comprehensive salary report for data & AI professionals. The sad trend continues where Women's salaries were sharply lower than men's salaries, averaging $126,000 annually, or 84% of the average salary for men ($150,000). Python, SQL & JavaScript are the top 3 most popular programming languages for data & AI.
https://www.oreilly.com/radar/2021-data-ai-salary-survey/
Vivian Guo: Make Machine Learning Work for Your Company - A Primer
Software-driven industrialization is moving from process-based workflow to an ML/AI-driven workflow. But how do you build a machine learning team? And what does this mean for software companies? The author walks through how to start a machine learning team, hiring & tracking the impact.
https://medium.com/iconiq-growth/make-machine-learning-work-for-your-company-a-primer-f68ad0b1cd40
Matt Turck: Red Hot - The 2021 Machine Learning, AI and Data (MAD) Landscape
Matt Turck published a comprehensive list of Machine Learning, AI & Data (MAD!!!) landscape. One interesting fact that I noticed in the landscape is that Jupiter notebooks still own the collaboration space. Data Engineering is inherently social & collaborative work across the org, and I can see this collaboration space still wide open.
https://mattturck.com/data2021/
Pinterest: Ensuring High Availability of Ads Realtime Streaming Services
Pinterest writes about its high available ads real-time streaming services on Apache Flink & Kafka stream. The hot-hot primary & standby pipeline for each service is an exciting design to read.
LinkedIn: Distributed tier merge: How LinkedIn tackles stragglers in search index build
LinkedIn writes about distributed tier merge in building offline search index using Apache Spark. The migration from MapReduce to Spark & distributed tier merge improved the build time by 40% across the product!!
https://engineering.linkedin.com/blog/2021/distributed-tier-merge
Sponsored: RudderStack - Churn Prediction With BigQueryML to Increase Mobile Game Revenue
Here’s an interesting case study on how machine learning can directly impact the bottom line. RudderStack writes an outline of how app developers, Torpedo Labs, use BigQuery ML to identify high-value mobile game players who are dangerously close to churning.
https://rudderstack.com/blog/churn-prediction-with-bigqueryml
DoorDash: How to Run Apache Airflow on Kubernetes at Scale
DoorDash writes an exciting blog narrating its migration of Airflow from a single instance infrastructure to KubernetesPodOperators. The blog states the higher memory availability of the Airflow scheduler after offloading the operator workloads to Kubernetes.
https://doordash.engineering/2021/09/28/how-to-run-apache-airflow-on-kubernetes-at-scale/
Airbnb: The Airflow Smart Sensor Service
Airflow poking sensor implementation is a resource-intensive operator that will keep running until the specified condition is satisfied. Airbnb writes about the impact of smart sensors on its Airflow infrastructure. With deduplication, it reduces 40% of the load from the Hive meta store.
https://medium.com/airbnb-engineering/the-airflow-smart-sensor-service-221f96227bcb
Storyblocks: Blue-Green ETLs with Airflow Task Groups
Storyblocks writes about adopting the Blue-Green ETL model with Airflow on its Redshift data warehouse. The load and swap in the mutable pipeline is always a challenge, and it's great to see the Blue-Green deployment pattern adoption.
https://medium.com/storyblocks-engineering/blue-green-etls-with-airflow-task-groups-71c36d120c2e
Wealthfront: Automating Data Quality Checks on External Data
Data pipeline on top of the external, uncontrolled datasets can be challenging. Wealthfront writes about its data quality approach following persisting the raw data, transforms to a confirmed schema and validate, and handles the anomalies.
https://eng.wealthfront.com/2021/09/28/automating-data-quality-checks-on-external-data/
Teads: Managing a BigQuery data warehouse at scale
Teads published helpful tips and tools to manage BigQuery to resolve slow-running queries and improve slot usage and table size. The BqVisualiser
looks like an exciting tool to visualize and optimize the query performance.ff
https://medium.com/teads-engineering/managing-a-bigquery-data-warehouse-at-scale-e6ec9a8406b2
Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers' opinions.