Data Engineering Weekly #43

Weekly Data Engineering Newsletter

Welcome to the 43rd edition of the data engineering newsletter. This week's release is a new set of articles that focus on Pedram’s building the modern data team, Nvidia’s explainable AP, LinkedIn’s update on responsible AI, Airbnb’s metrics computation at scale, Wrike TechClub’s data quality roadmap, Shopify’s In-Context analytics, Spotify’s visual analytics, Groupon’s Airflow adoption, Mapbox’s migration from Airflow to Dagster, Databricks top 10 announcement from Data + AI summit.


Pedram Navid: Building The Modern Data Team

Modern data toolings like DBT maturing can process and manage complex pipelines; however, building the modern data team remains challenging. In addition, prioritizing what the team should work on holds the key to minimizing the dysfunction of a team. In the blog, the author shares the views on data as a product, the good & bad of agile & scrum adoption for the data team.

https://pedram.substack.com/p/modern-data-team


Nvidia: What Is Explainable AI?

AI got adopted across industries as part of the core decision-making frameworks, from radiology, credit check to public policymaking. Hence Explainable AI (XAI) is a vital aspect of AI development. What is XAI? How does it work? Nvidia writes an exciting blog introducing XAI.

https://blogs.nvidia.com/blog/2021/05/24/what-is-explainable-ai/


LinkedIn: An update on Responsible AI at LinkedIn

On a similar line, LinkedIn talks about an update on responsible AI and how it embedded the principles in the design and engineering process. LinkedIn's responsible AI follows Microsoft's responsible AI principles, discusses AI fairness, privacy, and future roadmap.

https://engineering.linkedin.com/blog/2021/responsible-ai-update


Airbnb: How Airbnb Standardized Metric Computation at Scale - Part 2 - The six design principles of Minerva compute infrastructure

Airbnb writes about the second part of the Minerva platform to standardize metrics computation at scale. It's an exciting system design read with a declarative SDK to manage datasets, data versioning to maintain metric consistency, self-healing pipeline with batched backfilling, and data quality integrations.

https://medium.com/airbnb-engineering/airbnb-metric-computation-with-minerva-part-2-9afe6695b486


Wrike TechClub: Data Quality Roadmap

Data quality is a vital aspect of data engineering, and many companies talked about their internal implementation and data quality approach. However, how does one should start the journey of data quality? How does the roadmap look like, and what is the consequence of lacking certain engineering practices? The blog is an excellent narration of the data quality roadmap and reference articles to support data quality efforts.

Part-1: https://medium.com/wriketechclub/data-quality-roadmap-part-i-61332d5be7a

Part-2: https://medium.com/wriketechclub/data-quality-roadmap-part-ii-case-studies-614e85906178


Shopify: How Shopify Built An In-Context Analytics Experience

Though the dashboard visualization is a great way to get data into the customer's hand, integrating the analytics into the workflow brings much power to data products. Shopify writes an exciting blog on how it approaches the in-context analytics experience by metrics-driven product design.

https://shopifyengineering.myshopify.com/blogs/engineering/shopify-in-context-analytics


Spotify: Visual Analytics at Spotify

Visualization is a quick and meaningful way to interpret the data, and the visualization tools often quick to start but hard to master. Spotify writes an exciting blog on how hiring an expert visualization engineer to build core dashboards and templates & guides to standardize the dashboards improves the quality of data analytics.

https://medium.com/spotify-insights/visual-analytics-at-spotify-3d4221d8686


Groupon: Managing Billions of Data Points - Evolution of Workflow Management at Groupon

Groupon writes about its usage of Apache Airflow, and the decision to move away from cron scheduler. The blogs contains a comprehensive functional comparison chart among Apache Airflow, Oozie, Azkaban, and cron schedulers.

https://medium.com/groupon-eng/managing-billions-of-data-points-evolution-of-workflow-management-at-groupon-dab000a3440d


Mapbox/ Dagster: Incrementally Adopting Dagster at Mapbox

Mapbox shared their migration journey from Airflow to Dagster with the claim that Dagster reduced the core process time from days or weeks to 1-2 hours.!!! The blog narrates Dagster’s Airflow compatibility to do incremental migration, Dagster’s tooling support for testing & local development.

https://medium.com/dagster-io/incrementally-adopting-dagster-at-mapbox-b635b1118594


Databricks: Top 10 Announcements From Data + AI Summit

Databricks writes a quick recap of the top 10 announcements from Data + AI summit. Delta sharing, an open protocol to share data securely, data catalog, and Kolas merge into Apache Spark are some of the exciting development to watch in the near future.

https://databricks.com/blog/2021/06/04/dont-miss-these-top-10-announcements-from-data-ai-summit.html


Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers' opinions.