Airbnb: Sandcastle - data/AI apps for everyone
Product ideas powered by data and AI must go through rapid iteration on shareable, lightweight live prototypes instead of static proposals. However, hosting an internal application for fast prototyping is always a challenging platform to build and maintain. Airbnb writes about Sandcastle, an Airbnb-internal prototyping platform that enables data scientists, engineers, and product managers to bring data/AI ideas to life.
https://medium.com/airbnb-engineering/sandcastle-data-ai-apps-for-everyone-439f3b78b223
Grab: Enabling conversational data discovery with LLMs at Grab.
Grab confirms my observation that users abandoned their searches in 18% of sessions without clicking on any dataset. This suggested that the search results were not meeting their needs, which led to redesigning the system to focus on search accessible where data consumers hang out the most: Slack.
The LLM and vector search will significantly lead to headless data catalogs or merge into technical catalogs such as Unity Catalogs. I doubt that data catalogs, as standalone tools, will have their user experience as part of it.
https://engineering.grab.com/hubble-data-discovery
ngrok: How we built ngrok's data platform
Ngrok shares its experience building a real-time data platform with an event-driven architecture, ensuring scalability and efficient data management. The focus on handling high throughput while maintaining real-time analytics offers valuable insights into modern data platform design, complementing ideas we've explored in past discussions on scaling data platforms.
https://ngrok.com/blog-post/how-we-built-ngroks-data-platform
Sponsored: Monte Carlo Demo - October
Reliable data is the backbone of useful business insights—and keeping bad data out of production is step one.
For most analysts, delivering reliable insights means writing data quality rules –and lots of them – to monitor for issues and data quality dimensions. But, every second you spend writing and managing a data quality rule is another second you can’t spend delivering new value for stakeholders.
But what if—instead of writing all those rules—you could automate them instead? How might that impact your ability to deliver trusted business insights?
Join us as we explore the challenges of data quality management and how AI-powered data quality features can accelerate incident detection workflows to provide more visibility into the health of your data products.
Uber: Preon - Presto Query Analysis for Intelligent and Efficient Analytics
Query analytics is essential in data platform engineering to understand the popular tables, index utilization, types of queries, and predicate windows of the queries. Uber writes about Preon, a query analytical infrastructure for analyzing Presto Queries.
https://www.uber.com/blog/preon/
Presto: Query Optimization with Historical-Based Optimization Framework in Presto
The recent Presto technical blog goes one step further on query analytics, using it to optimize queries with Historical-Based optimization frameworks. The blog narrates the query planning caching on Redis on historical queries to compare the current execution plan and the historical plan to derive the optimal query execution plan.
Google: SQL Has Problems - We Can Fix Them - Pipe Syntax In SQL
It was a good weekend read about the proposed pipe syntax in SQL, which is more similar to Unix pipes in terms of its core concept—sequential data flow and transformation. Unix pipes typically represent a physical flow of data between processes. While visually suggesting a sequence, Pipe SQL maintains SQL's declarative nature. I like the addition of experimental debug operators such as DEBUG and ASSERT. What do you think about Pipe syntax in SQL?
https://research.google/pubs/sql-has-problems-we-can-fix-them-pipe-syntax-in-sql/
Yelp: Boosting ML Pipeline Efficiency: Direct Cassandra Ingestion from Spark
Yelp’s latest technical blog explores enhancing ML pipeline efficiency by enabling direct Cassandra ingestion from Spark. The article details how bypassing intermediate storage steps reduces latency and improves data processing speed. The approach highlights the importance of streamlining data workflows for faster machine learning model training and deployment.
Pinterest: Feature Caching for Recommender Systems w/ Cachelib
Pinterest’s blog discusses improving recommender systems' performance through feature caching using CacheLib. By caching frequently used features, Pinterest reduces latency and boosts system efficiency. The method underscores the value of optimizing data retrieval in real-time applications, especially for large-scale recommendation systems.
Frank Sommers: Document Similarity Search with ColPali
Colpali is gaining popularity because it significantly improves document retrieval efficiency and accuracy by leveraging vision-language models and bypassing the need for OCR, making it ideal for handling large and visually rich documents. The author narrates an in-depth overview of document similarity search with ColPali.
https://huggingface.co/blog/fsommers/document-similarity-colpali
All rights reserved ProtoGrowth Inc, India. I have provided links for informational purposes and do not suggest endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employer” opinions.