Astasia Myers: The three components of the unstructured data stack
LLMs and vector databases significantly improved the ability to process and understand unstructured data. I never thought of PDF as a self-contained document database, but that seems a reality that we can’t deny. The blog is an excellent summary of the existing unstructured data landscape.
https://www.felicis.com/insight/unstructured-data-stack
Figma: The infrastructure behind AI search in Figma
Figma writes about challenges in building vector search at scale. The learning mostly involves understanding the data's nature, frequency of data processing, and awareness of the computing cost. It is exciting to read probably the first blog on building a vector search infrastructure at scale.
https://www.figma.com/blog/the-infrastructure-behind-ai-search-in-figma/
Meta: IPLS - Privacy-preserving storage for your WhatsApp contacts
I spent quality time earlier this year assisting with India’s DPDP law requirements, which spiked my curiosity about building privacy-preserving computing. We recently published a comprehensive engineering guide to build a privacy-first design. The blog from Meta discusses how it designed a privacy-preserving storage.
Event Alert: IMPACT Summit
If you haven't registered for the IMPACT Summit yet, now's the perfect time 🔈
Here’s what we’ve got in store:
- A half-day virtual event created to elevate your 2025 data strategy
- Sessions jam-packed with industry experts sharing how they're driving data and AI adoption
- Practical tips and best practices from Monte Carlo customers
- Opportunities to connect and network with other data professionals
- Giveaways and raffles for attendees, including three All-Access subscriptions to DataExpert.io!
- And more!
What are you waiting for? Register for IMPACT today!
Uber: Streamlining Financial Precision - Uber’s Advanced Settlement Accounting System
Possibly one of the complicated pipelines to build is the Financial reconciliation engine. At the last DEWCon summit, Flipkart, India’s leading e-commerce company, talked about its reconciliation pipeline. On a similar line, Uber writes about its comprehensive settlement accounting system designed to handle the immense volume of transactions processed each month efficiently.
https://www.uber.com/blog/ubers-advanced-settlement-accounting-system/
Pinterest: Resource Management with Apache YuniKorn™ for Apache Spark™ on AWS EKS at Pinterest
Apache Yarn has limitations, including a lack of application isolation, high engineering effort to upgrade, and a lack of feature compatibility with the capacity and fair scheduler. Pinterest writes about adopting Apache YuniKorn, a yarn alternative with a Kubernetes-compatible resource scheduler for container orchestrator systems.
Event Alert: MLOps World/ Gen AI World - Austin, TX - Nov 7-8
The Gen AI Summit, consisting of a wider group of 20,000 Engineers, AI entrepreneurs, and Scientists, will host 1,000 AI teams in Austin, TX, November 7-8. Join for two days of sessions, socials, case studies, and workshop tutorials. Passes include app-brain-date networking, birds of a feature, post-event parties, etc. 60+ speakers from LinkedIn, Shopify, Amazon, Lyft, Grammarly, Mistral, et al.
Data Engineering Weekly readers get 15% discount by registering the following link,
Jack Vanlightly: The curse of Conway and the data space
Bringing data engineering development close to software engineering practices is a dream we all strive for. The blog argues that separating software development and data analytics teams within organizations is harmful, citing Conway’s Law to illustrate how this organizational structure negatively impacts software design. The author proposes three trends: data engineering as a software engineering discipline, data contracts and data products, and Shift Left as ways to address this problem.
https://jack-vanlightly.com/blog/2024/10/21/the-curse-of-conway-and-the-data-space
Alibaba: Evolution of Flink 2.0 State Management Storage-computing Separation Architecture
In last week’s newsletter, we highlighted the problems with Flink’s state management.
Ephemeral local storage is a blessing, but persistent local storage is a curse.
Luckily, the Flink community is actively innovating on this. Alibaba's blog gives an in-depth overview of Flink’s state management and what it takes to build a storage-compute separation architecture.
Wix: SageMaker Batch Transform Unleashed: My Journey at Wix to Achieve Scalable ML
Wix writes about implementing AWS SageMaker Batch Transform to enhance the efficiency of their machine learning model operations. Wix's system utilizes over 200 models daily, necessitating a scalable and robust solution. The implementation includes a sophisticated retry mechanism that addresses failed input files by isolating problematic rows and rerunning the Batch Transform job with an optimized configuration.
Grab: LLM-assisted vector similarity search.
Grab writes about integrating large language models (LLMs) with vector similarity search to improve the accuracy and relevance of search results. The authors propose a two-step approach: initially, a vector similarity search is performed to narrow down potential matches, followed by an LLM-based ranking process that leverages natural language understanding to refine the results. The benchmark demonstrates that this LLM-assisted search outperforms traditional vector similarity searches in handling complex and nuanced queries.
https://engineering.grab.com/llm-assisted-vector-similarity-search.
Expedia: Gateways, Guardrails, and GenAI Models
Expedia writes about the "GenAI Toolkit" for implementing and controlling access to generative AI models within enterprise environments. It highlights the challenges of using platforms like ChatGPT and Azure AI Service, which lack sufficient control over account usage and cost tracking across multiple applications. The GenAI Toolkit, which comprises the GenerativeAI Proxy (GAP) and EG-Guardrails service, provides a secure and flexible solution by addressing data protection, content filtering, authentication, and resource management issues.
https://medium.com/expedia-group-tech/gateways-guardrails-and-genai-models-aa606379164d
All rights reserved ProtoGrowth Inc, India. I have provided links for informational purposes and do not suggest endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employer” opinions.