Data Engineering Weekly #191

The Weekly Data Engineering Newsletter

Sep 30, 2024

Airbnb: Sandcastle - data/AI apps for everyone

Product ideas powered by data and AI must go through rapid iteration on shareable, lightweight live prototypes instead of static proposals. However, hosting an internal application for fast prototyping is always a challenging platform to build and maintain. Airbnb writes about Sandcastle, an Airbnb-internal prototyping platform that enables data scientists, engineers, and product managers to bring data/AI ideas to life.

https://medium.com/airbnb-engineering/sandcastle-data-ai-apps-for-everyone-439f3b78b223

Grab: Enabling conversational data discovery with LLMs at Grab.

Grab confirms my observation that users abandoned their searches in 18% of sessions without clicking on any dataset. This suggested that the search results were not meeting their needs, which led to redesigning the system to focus on search accessible where data consumers hang out the most: Slack.

The LLM and vector search will significantly lead to headless data catalogs or merge into technical catalogs such as Unity Catalogs. I doubt that data catalogs, as standalone tools, will have their user experience as part of it.

https://engineering.grab.com/hubble-data-discovery

ngrok: How we built ngrok's data platform

Ngrok shares its experience building a real-time data platform with an event-driven architecture, ensuring scalability and efficient data management. The focus on handling high throughput while maintaining real-time analytics offers valuable insights into modern data platform design, complementing ideas we've explored in past discussions on scaling data platforms.

https://ngrok.com/blog-post/how-we-built-ngroks-data-platform

Uber: Preon - Presto Query Analysis for Intelligent and Efficient Analytics

Query analytics is essential in data platform engineering to understand the popular tables, index utilization, types of queries, and predicate windows of the queries. Uber writes about Preon, a query analytical infrastructure for analyzing Presto Queries.

https://www.uber.com/blog/preon/

Presto: Query Optimization with Historical-Based Optimization Framework in Presto

The recent Presto technical blog goes one step further on query analytics, using it to optimize queries with Historical-Based optimization frameworks. The blog narrates the query planning caching on Redis on historical queries to compare the current execution plan and the historical plan to derive the optimal query execution plan.

https://prestodb.io/blog/2024/09/26/query-optimization-with-historical-based-optimization-framework-in-presto/

Google: SQL Has Problems - We Can Fix Them - Pipe Syntax In SQL

It was a good weekend read about the proposed pipe syntax in SQL, which is more similar to Unix pipes in terms of its core concept—sequential data flow and transformation. Unix pipes typically represent a physical flow of data between processes. While visually suggesting a sequence, Pipe SQL maintains SQL's declarative nature. I like the addition of experimental debug operators such as DEBUG and ASSERT. What do you think about Pipe syntax in SQL?

https://research.google/pubs/sql-has-problems-we-can-fix-them-pipe-syntax-in-sql/

Yelp: Boosting ML Pipeline Efficiency: Direct Cassandra Ingestion from Spark

Yelp’s latest technical blog explores enhancing ML pipeline efficiency by enabling direct Cassandra ingestion from Spark. The article details how bypassing intermediate storage steps reduces latency and improves data processing speed. The approach highlights the importance of streamlining data workflows for faster machine learning model training and deployment.

https://engineeringblog.yelp.com/2024/09/boosting-ml-pipeline-efficiency-direct-cassandra-ingestion-from-spark.html

Pinterest: Feature Caching for Recommender Systems w/ Cachelib

Pinterest’s blog discusses improving recommender systems' performance through feature caching using CacheLib. By caching frequently used features, Pinterest reduces latency and boosts system efficiency. The method underscores the value of optimizing data retrieval in real-time applications, especially for large-scale recommendation systems.

https://medium.com/pinterest-engineering/feature-caching-for-recommender-systems-w-cachelib-8fb7bacc2762

Frank Sommers: Document Similarity Search with ColPali

Colpali is gaining popularity because it significantly improves document retrieval efficiency and accuracy by leveraging vision-language models and bypassing the need for OCR, making it ideal for handling large and visually rich documents. The author narrates an in-depth overview of document similarity search with ColPali.

https://huggingface.co/blog/fsommers/document-similarity-colpali

All rights reserved ProtoGrowth Inc, India. I have provided links for informational purposes and do not suggest endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employer” opinions.

Data Engineering Weekly