Data Engineering Weekly #29

Weekly Data Engineering Newsletter

Feb 14, 2021

Welcome to the 29th edition of the data engineering newsletter. This week's release is a new set of articles that focus on Google’s research paper on Data Cascades in High-Stakes AI, Fiddler Labs debugging ML model performance, Monte Carlo’s Data Observability Using SQL, Airbnb’s Superset adoption, Apache Kylin’s Evolution of Precomputation, Spotify’s Sorted Merge Bucket implementation, Doordash’s effective data science communication, Funding Societies Data Governance journey, QueryClick’s Self-Serve analytical journey, and Databricks Delta Lake 0.8.

Google: "Everyone wants to do the model work, not the data work" - Data Cascades in High-Stakes AI

Data quality has an enormous effect on the results and efficiency of AI. Data quality carries an elevated significance in high-stakes AI due to its heightened downstream impact, impacting predictions like cancer detection, wildlife poaching, and loan allocations. For instance, poor data practices reduced IBM’s cancer treatment AI accuracy and led to Google Flu Trends missing the flu peak by 140%.

What We Can Learn From the Epic Failure of Google Flu Trends

Google research published a report on data practices in high-stakes AI from interviews with 53 AI practitioners in India, East and West African countries, and the USA. The paper captures the Data cascading effect causing adverse, downstream effects from data issues, resulting in negative social impact.

One of the disrupting read to know 92% of AI practitioners reported experiencing one or more, and 45.3% reported two or more cascades in a given project. I highly encourage the data engineers to read the report. I believe there is a potential social enterprise opportunity.

https://research.google/pubs/pub49953/

Fiddler Labs: Debug Machine Learning model performance issue

The Twitter thread is an exciting read, where the author shared the experience working on Facebook's newsfeed ranking platform on debugging the machine learning model performance. The thread emphasized most Machine Learning model performance issues due to data pipeline issues and the importance of explainable AI.

Krishna Gade @krishnagade

I was an eng leader on Facebook’s NewsFeed and my team was responsible for the feed ranking platform. Every few days an engineer would get paged that a metric e.g., “likes” or “comments” is down. It usually translated to a Machine Learning model performance issue. /thread

Monte Carlo: Data Observability in Practice Using SQL

The previous two articles talked about the importance of data quality and the impact of inadequate data pipeline observability. How can we establish the most simplistic data pipeline monitoring? The databases traditionally added constraints part of DDL to ensure integrity. The modern data pipeline requires much more options than simple constraints. Monto Carlo writes an exciting two-part blog narrating how one can use SQL to measure critical data pipeline reliability.

https://www.montecarlodata.com/data-observability-in-practice-using-sql-1/

https://towardsdatascience.com/data-observability-in-practice-using-sql-part-ii-schema-lineage-5ca6c8f4f56a

Airbnb: Supercharging Apache Superset

Airbnb writes about its Apache Superset adoption growth and performance improvement strategy. It's impressive to see Airbnb's data ecosystem now comprises more than 100,000 tables and virtual datasets backing over 200,000 charts and 14,000 dashboards. The predictive cache warm-up, domain sharding for high concurrency, and query rate-limiting are exciting to read on dashboard performance optimization strategies.

https://medium.com/airbnb-engineering/supercharging-apache-superset-b1a2393278bd

Apache Kylin: The Evolution of Precomputation Technology and its Role in Data Analytics

Precomputation is a common technique used in information retrieval and analysis, including index, materialized view, OLAP cube, and more. The blog narrates the evolution of pre-computation, the future of pre-computation, and the role of AI & automation technology shaping the pre-computation. Airbnb applied a similar strategy in the previous article on supercharging Apache Superset.

https://www.infoq.com/articles/evolution-precomputation-technology-data-analytics/

Spotify: How Spotify Optimized the Largest Dataflow Job Ever for Wrapped 2020

Data skew and shuffle are the two curse of data processing. Spotify writes an exciting post on high impact usage of Sorted Merge Bucket (SMB) join to optimize its data pipeline. The SortedBucketSink, SortedBuketSource, and the filehandle iterator's usage remind me of Slack’s batch search infrastructure implementation, and it is great to see a framework abstraction for SMB implementation.

https://engineering.atspotify.com/2021/02/11/how-spotify-optimized-the-largest-dataflow-job-ever-for-wrapped-2020/

Doordash: How to Drive Effective Data Science Communication with Cross-Functional Teams

The data analytics team's vital responsibility is to communicate actionable insights to key stakeholders, not just identify and measure them. Clear communication to the key stakeholders ensures clear strategic direction and actionable business insight. Doordash's analytical team writes an exciting post emphasize the need for an established communication framework and detail some of the best practices it follows.

https://doordash.engineering/2021/02/11/how-to-drive-effective-data-science-communication/

Funding Societies: Data governance journey at SEA’s largest digital P2P lending platform

Comprehensive data governance and data management are essential for a financial system, not only for business growth but also for strict regulatory requirements. The Funding society writes an in-depth narration of its data governance journey from executive buy-in, define data governance policy, Data & access management policy, and data domain driven design.

https://medium.com/fsmk-engineering/data-governance-journey-at-seas-largest-digital-p2p-lending-platform-part-1-7a7e8f07b7f

https://medium.com/fsmk-engineering/data-governance-journey-at-seas-largest-digital-p2p-lending-platform-part-2-ebaa098b6acf

QueryClick: Our (Bumpy) Road To Self Service Analytics | QueryClick

Self-Serving analytical infrastructure is a north star system design for any data infrastructure systems. It requires cultural and technological changes, which architecture should account for. On a similar line, QueryClick shares its self-serving analytics journey.

https://medium.com/queryclick-tech-blog/queryclicks-bumpy-road-to-self-service-analytics-664a154de6a2

Databricks: Automatically Evolve Your Nested Column Schema, Stream From a Delta Table Version, and Check Your Constraints

Databricks writes about some of the key features released part of the Delta Lake 0.8 version. It's exciting to read some of the new features like the schema evolution support nested column schema with auto-merge support, support for constraints, and the ability to time travel delta stream from a specific version.

https://databricks.com/blog/2021/02/10/automatically-evolve-your-nested-column-schema-stream-from-a-delta-table-version-and-check-your-constraints.html

Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers' opinions.