Data Engineering Weekly #190

The Weekly Data Engineering Newsletter

Sep 23, 2024

Editor’s Note: Coming Next on Comparison Matrix Series - Data LakeHouse

Our mission is to empower data professionals and organizations to make informed, data-driven decisions by providing a comprehensive buyer's guide and comparison matrix for selecting the best data tools. We have already published a comparison matrix for CDC and Data Observability. Next, we will publish a comparison matrix for LakeHouse.

All the comparison matrix builds upon collective interactive sessions with the Data Heros community. Please watch the community LinkedIn page, where we will post the link for the discussion.

We won't record this conversation in the true spirit of community and open knowledge-sharing.

Uber: QueryGPT – Natural Language to SQL Using Generative AI

Companies are increasingly adopting the natural language interface to interact with the enterprise data.

I’m fairly confident that a framework that supports a deterministic way of building a data pipeline and a conversational way of building business logic is the future of data pipeline engineering.

Uber writes about its journey in building the natural query interface for the data warehouse, including the lessons learned from the first iteration and the adoption of the multi-agent approach to tuning accuracy.

https://www.uber.com/blog/query-gpt/

Dropbox: How we use Lakera Guard to secure our LLMs

As the adoption of LLM increases, prompt injection, hallucination, and other security & compliance guards are required to secure the application. Dropbox writes a case study on integrating Lakera Guard into their prompt engineering to secure the LLM infrastructure.

https://dropbox.tech/security/how-we-use-lakera-guard-to-secure-our-llms

Flipkart: Flipkart Enhances AI Safety in E-Commerce - Implementing NVIDIA NeMo Guardrails

Flipkart is leveraging NVIDIA's NeMo Guardrails to strengthen AI safety in its e-commerce platform, ensuring compliance and safeguarding customer interactions. This move demonstrates the growing importance of AI governance in real-world applications, especially as data-driven businesses like Flipkart scale their machine-learning operations. This highlights a crucial trend for data engineers: AI safety mechanisms are becoming an integral part of ML pipelines, reinforcing the need for robust monitoring and control frameworks in the data ecosystem.

https://blog.flipkart.tech/flipkart-enhances-ai-safety-in-e-commerce-implementing-nvidia-nemo-guardrails-cb2f293b29c0

Thomson Reuters Labs: Better Customer Support Using Retrieval-Augmented Generation (RAG) at Thomson Reuters

Thomson Reuters enhances customer support by using Retrieval-Augmented Generation (RAG), which integrates large language models with real-time information retrieval. The combination allows customer service teams to provide accurate and context-aware responses by pulling up-to-date information from company resources. The key takeaway is that RAG adoption is increasing, and that shows in the improved quality and relevance of automated responses, making customer interactions more efficient and effective.

https://medium.com/tr-labs-ml-engineering-blog/better-customer-support-using-retrieval-augmented-generation-rag-at-thomson-reuters-4d140a6044c3

Grab: LLM-powered data classification for data entities at scale.

Grab has implemented a powerful solution for data classification by leveraging large language models (LLMs) to tag and categorize their data automatically. Traditionally, this task was labor-intensive and prone to inconsistencies. By integrating LLMs, Grab streamlined the classification process, allowing the system to efficiently generate column-level tags for sensitive data and business metrics. The LLM-powered approach improves accuracy and speed, providing a more scalable way to manage metadata for their diverse and rapidly growing data.

https://engineering.grab.com/llm-powered-data-classification

Wix: Customizing LLMs for Enterprise Data Using Domain Adaptation: The Wix Journey

The logical next step for enterprises to adopt LLM is to train the model with domain-specific data. Wix writes about customizing large language models (LLMs) for enterprise data using domain adaptation techniques. This process allows them to fine-tune LLMs to better understand and respond to their specific data, enhancing performance for tasks like customer service and internal operations.

https://www.wix.engineering/post/customizing-llms-for-enterprise-data-using-domain-adaptation-the-wix-journey

Pinterest: Pinterest Tiered Storage for Apache Kafka®️: A Broker-Decoupled Approach

Decoupling the storage from the compute/ brokers significantly reduces the total cost of ownership. We see this trend in stream processing systems. Pinterest writes about its adoption of tiered storage for Apache Kafka. The system handles large-scale data by offloading older, less frequently accessed data to cheaper storage tiers while keeping active data on high-performance storage.

https://medium.com/pinterest-engineering/pinterest-tiered-storage-for-apache-kafka-%EF%B8%8F-a-broker-decoupled-approach-c33c69e9958b

Brittany: Lessons from Human Data Analysts to Improve the AI Variety

Can AI replace human data analysts? The author shares a perspective on how human analysts can potentially use AI. The author emphasizes that human intuition, experience, and contextual understanding are key to making sense of complex data beyond what automated tools can achieve but states the importance of combining human expertise with AI data tools to foster better collaboration between business teams and data analysts.

https://datafordoers.substack.com/p/lessons-from-human-data-analysts

Yuno: How Apache Hudi transformed Yuno’s Data Lake

The diagram illustrates the architecture of a lakehouse within a VPC (Virtual Private Cloud) environment.

Yuno writes about transforming its data infrastructure by implementing Apache Hudi, optimizing data lake performance, and reducing costs by 70%. Hudi's features, such as time travel, indexing, and automated file management, enabled real-time data insights and improved efficiency. Yuno utilized Hudi’s flexibility across different use cases and integrated it with AWS Glue and Airflow for orchestration.

https://www.y.uno/post/how-apache-hudi-transformed-yunos-data-lake

HomeToGo: HomeToGo’s North Star Metric for our Data Domain

Can you measure the effectiveness of your data ecosystem? The HomeToGo team writes about a set of "North Star Metrics" to align its data domain efforts with business objectives. The metrics are a guiding measure to track the performance and impact of data initiatives across the organization. The key takeaway is that having a clear, measurable objective allows teams to maintain focus and ensure their work contributes directly to the company's overall goals.

https://engineering.hometogo.com/hometogos-north-star-metric-for-our-data-domain-7f0f5fb96e30

All rights reserved ProtoGrowth Inc, India. I have provided links for informational purposes and do not suggest endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employer” opinions.

Data Engineering Weekly