Data Engineering Weekly #92

The Weekly Data Engineering Newsletter

Jul 11, 2022

Google Research: "Because AI is 100% right and safe": User Attitudes and Sources of AI Authority in India

Authors: Shivani Kapania, Oliver Siy, Gabe Clapper, Azhagu SP, Nithya Sambasivan

An exciting weekend read about the AI Authority and the societal perception of it. (e.g.) 79.2% of people in India prefer AI to validate their loan applications over human validation. AI assigned legitimized power to influence people's actions without presenting adequate evidence of the system's capability.

AI impacts the social-economic factors of a large population, yet the data community rarely talks about explainable AI and its user experience. A few internet companies talk about explainable AI, but it feels like marketing. I have never seen their product reflects or gives me an explainable AI experience.

The question I want to leave with the Data Engineering Community is that we are too busy talking about zombie dashboards yet never about their social impact. Are we on the wrong side of history?

https://research.google/pubs/pub51146/

If the paper is a bit long, the tech talk is an excellent summary of the paper.

Machine Learning Operations (MLOps): Overview, Definition, and Architecture

Authors: Dominik Kreuzberger, Niklas Kühl, Sebastian Hirschl

Why was the ML project never taken off from the prototype? Why is it operationally challenging to run an ML infrastructure? The paper is an excellent summary of the current state of MLOps.

https://arxiv.org/pdf/2205.02302.pdf

Neelesh Salian: Reflecting on DataAISummit 2022

Conference, reading, and interaction with the data practitioners are great ways to question our perspectives. The author writes an excellent perspective-changing summary of the Hadoop world to Cloud data warehouses with DataAISummit.

https://hysterical.substack.com/p/reflecting-on-dataaisummit2022

Databricks: Introducing Spark Connect – The Power of Apache Spark, Everywhere

One of the curious announcements in the DataAISummit is Spark Connect. The idea of Spark everywhere is interesting. Databricks published more details on Spark Connect this week.

https://databricks.com/blog/2022/07/07/introducing-spark-connect-the-power-of-apache-spark-everywhere.html

You can find the proposal document for Spark Connect here,

https://docs.google.com/document/d/1Mnl6jmGszixLW4KcJU5j9IgpG9-UabS0dcM6PM2XGDc/edit#

James Le: What I Learned From Tecton’s apply() 2022 Conference

Conferenceogue is an excellent way to capture the essence of a conference from the data practitioner's perspective. The author writes about the Tecton's apply() 2022 Conference focuses on industry trends, production use cases, and open-source libraries out of the conference.

https://data-notes.co/what-i-learned-from-tectons-apply-2022-conference-86f3bf87f80e

Barr Moses: Implementing Data Contracts: 7 Key Learnings

Data Contracts are an emerging need to manage large-scale data infrastructure. The author shared vital learnings from implementing Data Contracts with GoCardless.

Over the last ten years, We significantly commoditized data asset creation by simplifying the data processing tools. However, we have yet to democratize the exchange of business context, which the data represent among various stakeholders. Schemata.app is an attempt to solve the data contract where we recently added dbt support. More about Schemata is coming soon in a series of blog post.

https://barrmoses.medium.com/implementing-data-contracts-7-key-learnings-d214a5947d5e

Uber: Uber’s Highly Scalable and Distributed Shuffle as a Service

Data shuffle in the MapReduce paradigm brings reliability, efficiency, and scalability issues. LinkedIn, in the past, wrote about its shuffle service Magnet, which merged into the Spark 3.20 release. [SPARK-30602 & SPARK-33235]

Uber writes about its Spark shuffle service with Remote Shuffle Service (RSS). The blog narrates how RSS works and its efficiency in solving reliability & scalability issues.

https://eng.uber.com/ubers-highly-scalable-and-distributed-shuffle-as-a-service/

MyHeritage: Guide to - StreamingQueryListener in PySpark Streaming

An excellent guide on using Spark's streaming query listener to find the discrepancy between the Spark checkpoint and Kafka commit offset.

https://medium.com/myheritage-engineering/guide-to-streamingquerylistener-in-pyspark-streaming-f3bbfe56a774

Criteo : Using AI and NLP to Tackle Advertisement on Websites Spreading Disinformation on the Topic of Conflict in Ukraine

Criteo writes about its disinformation model to tackle advertisements on websites spreading disinformation about the ongoing conflict in Ukraine. TIL about the Global Disinformation Index [https://www.disinformationindex.org]

https://medium.com/criteo-engineering/using-ai-and-nlp-to-tackle-advertisement-spreading-disinformation-on-the-conflict-in-ukraine-59b6745a0c43

Square: Success Metrics for Product Analytics - Metrics are not a replacement for strategy

Possibly one of the thought-provoking articles I read this week in data analytics. The author questions the role & responsibility of a product data scientist highlighting the grey areas of A/B testing and the neutral results.

https://developer.squareup.com/blog/success-metrics-for-product-analytics/

Han Yu: 5 Best Open Source Data Lineage Tools in 2022

A great compilation of the open source data lineage tools. I’m not aware of half of the tools in the article :-) Time to read more about these open-source lineage tools.

https://blog.devgenius.io/5-best-open-source-data-lineage-tools-in-2022-f8ef39a7d5f6

All rights reserved Pixel Impex Inc, India. Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.

Data Engineering Weekly