Sponsored: Rudderstack - DROP the Modern Data Stack
It’s time to make sense of today’s data tooling ecosystem. Check out rudderstack.com/dmds to get a guide that will help you build a practical data stack for every phase of your company’s journey to data maturity. The guide includes architecture and tactical advice to help you progress through four stages: Starter, Growth, Machine Learning, and Real-Time. Visit RudderStack.com/dmds today to DROP the modern data stack and USE a practical data engineering framework.
Google Research: "Because AI is 100% right and safe": User Attitudes and Sources of AI Authority in India
Authors: Shivani Kapania, Oliver Siy, Gabe Clapper, Azhagu SP, Nithya Sambasivan
An exciting weekend read about the AI Authority and the societal perception of it. (e.g.) 79.2% of people in India prefer AI to validate their loan applications over human validation. AI assigned legitimized power to influence people's actions without presenting adequate evidence of the system's capability.
AI impacts the social-economic factors of a large population, yet the data community rarely talks about explainable AI and its user experience. A few internet companies talk about explainable AI, but it feels like marketing. I have never seen their product reflects or gives me an explainable AI experience.
The question I want to leave with the Data Engineering Community is that we are too busy talking about zombie dashboards yet never about their social impact. Are we on the wrong side of history?
https://research.google/pubs/pub51146/
If the paper is a bit long, the tech talk is an excellent summary of the paper.
Machine Learning Operations (MLOps): Overview, Definition, and Architecture
Authors: Dominik Kreuzberger, Niklas Kühl, Sebastian Hirschl
Why was the ML project never taken off from the prototype? Why is it operationally challenging to run an ML infrastructure? The paper is an excellent summary of the current state of MLOps.
https://arxiv.org/pdf/2205.02302.pdf
Neelesh Salian: Reflecting on DataAISummit 2022
Conference, reading, and interaction with the data practitioners are great ways to question our perspectives. The author writes an excellent perspective-changing summary of the Hadoop world to Cloud data warehouses with DataAISummit.
https://hysterical.substack.com/p/reflecting-on-dataaisummit2022
Databricks: Introducing Spark Connect – The Power of Apache Spark, Everywhere
One of the curious announcements in the DataAISummit is Spark Connect. The idea of Spark everywhere is interesting. Databricks published more details on Spark Connect this week.
You can find the proposal document for Spark Connect here,
https://docs.google.com/document/d/1Mnl6jmGszixLW4KcJU5j9IgpG9-UabS0dcM6PM2XGDc/edit#
Sponsored: Firebolt - Firebolt is a proud sponsor of Data Engineering Weekly.
Firebolt is the cloud data warehouse for builders of next-gen analytics experiences. Combining the benefits and ease of use of modern architecture with a sub-second performance at a terabyte-scale, Firebolt helps data engineering and dev teams deliver data applications that end-users love.
https://www.firebolt.io/
James Le: What I Learned From Tecton’s apply() 2022 Conference
Conferenceogue is an excellent way to capture the essence of a conference from the data practitioner's perspective. The author writes about the Tecton's apply() 2022 Conference focuses on industry trends, production use cases, and open-source libraries out of the conference.
https://data-notes.co/what-i-learned-from-tectons-apply-2022-conference-86f3bf87f80e
Barr Moses: Implementing Data Contracts: 7 Key Learnings
Data Contracts are an emerging need to manage large-scale data infrastructure. The author shared vital learnings from implementing Data Contracts with GoCardless.
Over the last ten years, We significantly commoditized data asset creation by simplifying the data processing tools. However, we have yet to democratize the exchange of business context, which the data represent among various stakeholders. Schemata.app is an attempt to solve the data contract where we recently added dbt support. More about Schemata is coming soon in a series of blog post.
https://barrmoses.medium.com/implementing-data-contracts-7-key-learnings-d214a5947d5e
Sponsored: Rudderstack - What is the Machine Learning Stack?
A detailed guide to building the Machine Learning Stack—an architecture to help you take your first steps into the world of ML and move from historical analytics to predictive analysis. The ML stack is phase three of RudderStack's Data Maturity Journey framework.
https://www.rudderstack.com/blog/what-is-the-ml-stack
Uber: Uber’s Highly Scalable and Distributed Shuffle as a Service
Data shuffle in the MapReduce paradigm brings reliability, efficiency, and scalability issues. LinkedIn, in the past, wrote about its shuffle service Magnet, which merged into the Spark 3.20 release. [SPARK-30602 & SPARK-33235]
Uber writes about its Spark shuffle service with Remote Shuffle Service (RSS). The blog narrates how RSS works and its efficiency in solving reliability & scalability issues.
https://eng.uber.com/ubers-highly-scalable-and-distributed-shuffle-as-a-service/
MyHeritage: Guide to - StreamingQueryListener in PySpark Streaming
An excellent guide on using Spark's streaming query listener to find the discrepancy between the Spark checkpoint and Kafka commit offset.
Criteo : Using AI and NLP to Tackle Advertisement on Websites Spreading Disinformation on the Topic of Conflict in Ukraine
Criteo writes about its disinformation model to tackle advertisements on websites spreading disinformation about the ongoing conflict in Ukraine. TIL about the Global Disinformation Index [https://www.disinformationindex.org]
Square: Success Metrics for Product Analytics - Metrics are not a replacement for strategy
Possibly one of the thought-provoking articles I read this week in data analytics. The author questions the role & responsibility of a product data scientist highlighting the grey areas of A/B testing and the neutral results.
https://developer.squareup.com/blog/success-metrics-for-product-analytics/
Han Yu: 5 Best Open Source Data Lineage Tools in 2022
A great compilation of the open source data lineage tools. I’m not aware of half of the tools in the article :-) Time to read more about these open-source lineage tools.
https://blog.devgenius.io/5-best-open-source-data-lineage-tools-in-2022-f8ef39a7d5f6
All rights reserved Pixel Impex Inc, India. Links are provided for informational purposes and do not imply endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.