Data Engineering Weekly #170

The Weekly Data Engineering Newsletter

May 06, 2024

Ken Liu: Machine Unlearning in 2024

One of the insightful articles is about the growing adoption of one large language model and the challenge it brings to machine unlearning. The motivation for Machine Unlearning is critical from the privacy perspective and for model correction, fixing outdated knowledge, and access revocation of the training dataset.

A key thought-provoking moment for me while reading the article is this quote.

In an ideal world, data should be thought of as “borrowed” (possibly unpermitted) and thus can be “returned,” and unlearning should enable such revocation.

https://ai.stanford.edu/~kzliu/blog/unlearning

Uber: From Predictive to Generative – How Michelangelo Accelerates Uber’s AI Journey

Constantly adopting and implementing tech advancement with an existing system indicates efficient engineering. Uber wrote an in-depth article about the evolution of its centralized ML platform, Michelangelo.

https://www.uber.com/blog/from-predictive-to-generative-ai/

LinkedIn: LakeChime - A Data Trigger Service for Modern Data Lakes

LinkedIn points out two critical flaws in a partitioned approach to data management.

The granularity of partition creation constrained data consumption. For instance, if partitions were created daily, consumers could only schedule daily jobs to consume new partitions.
Partitions are created once, but their data can continually be updated.

LinkedIn writes about LakeChime, a data trigger service that acts as a signal processor for downstream jobs to handle partition and snapshot table layouts.

Figure 1: LakeChime data trigger ecosystem architecture diagram

https://www.linkedin.com/blog/engineering/data-management/lakechime-a-data-trigger-service-for-modern-data-lakes

Booking.com: Lessons in adopting Airflow on Google Cloud

Booking.com writes about the lessons learned from adopting Airflow on Google Cloud. The learning focuses on

Setting up the local development environment
Performance tuning with Celery workers
embedded documentation in Airflow DAG

https://medium.com/booking-com-development/lessons-in-adopting-airflow-51821709cba4

PayPal: Scaling PayPal’s AI Capabilities with PayPal Cosmos.AI Platform

PayPal writes about its internal AI platform cosmos.ai, which provides MLOps capabilities that streamline processes like model training, deployment, and monitoring, significantly reducing complexity and costs. The platform also emphasizes extensibility and future-proofing against rapid technology changes, focusing on responsible AI usage, multi-tenancy, self-service capabilities, and seamless integration with existing systems.

https://medium.com/paypal-tech/scaling-paypals-ai-capabilities-with-paypal-cosmos-ai-platform-e67a48e04691

Iswarya Murali: Trustworthiness of Generative AI: Reducing hallucinations and increasing transparency

As much as Gen AI's potential is promising, mistrust and skepticism could encumber AI adoption. Hallucinations and the system's lack of explainability are the primary reasons for mistrust in Gen AI. The author highlights some key strategies to reduce the hallucinations and increase transparency.

https://medium.com/data-science-at-microsoft/trustworthiness-of-generative-ai-reducing-hallucinations-and-increasing-transparency-a53dfe190ee1

Sanjeev Mohan: Untangling the Streaming Landscape: The Rise of Unified Real-time Platforms

Stream processing comes in different forms, with event stream processing, streaming databases, and stream-enabled analytical databases. The larger question is, should one consider a unified real-time platform? The author expands on the possibility of unified data platforms.

https://sanjmo.medium.com/untangling-the-streaming-landscape-the-rise-of-unified-real-time-platforms-528f49318632

Ergest Xheblati: Transforming a Data Culture

Every data team has this burning question: How do we pivot away from answering questions and building dashboards and toward being a strategic partner who has a real impact on the business? The author narrates a path toward achieving a predictable organizational outcome.

It’s no surprise that everything starts with getting buy-in from the executive team. There are essentially two types of companies: those that believe in data and those that don’t. It’s an uphill battle for the data team if you end up in an organization where the executives don’t believe in data for the decision-making process.

https://sqlpatterns.com/p/transforming-a-data-culture

Miles McBain: Patterns and anti-patterns of data analysis reuse

We practiced and discussed reusability in software engineering, but I never thought deeply about data analytics reuse. For me, It is always adding additional dimensions in a dashboard to bring reusability, but that won’t be the case in ad-hoc analytics. The author discusses the common patterns of data analytics reuse, anti-patterns on each usage and best practices to mitigate technical debts.

https://www.milesmcbain.com/posts/data-analysis-reuse/

Daniel Beach: Delta Lake - Map and Array data types

Having a well-structured data model is always great, but we often handle semi-structured data. The fact that the nature of the event sourcing mostly deals with JSON structure adds more complexity. The LakeHouse format’s in-build support for Map and Array gives the flexibility to handle semi-structured data.

However, the Map and Array comes with its cost. There is no proper indexing support for these complex data types, causing query complications and unhappy customers. The LakeHouse formats should move beyond primitive types and inherent indexing support for complex data types for faster query performance.

https://dataengineeringcentral.substack.com/p/delta-lake-map-and-array-data-types

All rights reserved ProtoGrowth Inc, India. I have provided links for informational purposes and do not suggest endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employer” opinions.

Data Engineering Weekly