Data Engineering Weekly #161

The Weekly Data Engineering Newsletter

Mar 03, 2024

RudderStack is the Warehouse Native CDP, built to help data teams deliver value across the entire data activation lifecycle, from collection to unification and activation. Visit rudderstack.com to learn more.

Editor’s Note: Chennai, India Meetup - March-08 Update

We are thankful to Ideas2IT to host our first Data Hero’s meetup. There will be food, networking, and real-world talks around data engineering. Here is the agenda,

1) Data Application Lifecycle Management - Harish Kumar( Paypal)

Hear from the team in PayPal on how they build the data product lifecycle management (DPLM) systems.

2) Why High-Quality Data Products Beats Complexity in Building LLM Apps - Ananth Packildurai

I will walk through the evolution of model-centric to data-centric AI and how data products and DPLM (Data Product Lifecycle Management) systems are vital for an organization's system.

3) DataOPS at AstraZeneca

The AstraZeneca team talks about data ops best practices internally established and what worked and what didn’t work!!!

4) Building Data Products and why should you? by Aswin James Christy( Qlik/Talend)

Aswin establishes a case for why you should start thinking about building data products.

SignUp Link: https://nas.io/data-heroes/data-hereos-chennai-meetup

Pinterest: Unlocking AI-Assisted Development Safely: From Idea to GA

Pinterest detailed its journey of implementing AI-assisted development, highlighting its initial caution due to legal and security concerns and its eventual decision to adopt GitHub Copilot for its seamless IDE integration and compatibility with its tooling ecosystem. Through an extensive trial involving 200 developers, with an established feedback loop, ensuring Copilot's effectiveness and security. This approach led to a successful expansion of Copilot access across the engineering team, resulting in a significant increase in productivity and adoption, demonstrating a commitment to enhancing developer experience while maintaining safety and security standards.

https://medium.com/pinterest-engineering/unlocking-ai-assisted-development-safely-from-idea-to-ga-4d68679161ef

Microsoft: GraphRAG - Unlocking LLM discovery on narrative private data

The article introduces GraphRAG, a Microsoft Research innovation enhancing Large Language Models' (LLMs) capabilities in analyzing narrative private data through knowledge graphs. GraphRAG significantly improves question-and-answer performance over traditional vector similarity techniques using LLM-generated knowledge graphs for document analysis. This approach enables deeper insights into complex datasets that LLMs have not been trained on, demonstrating substantial improvements in data understanding and thematic discovery.

https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/

Nvidia: What Is Sovereign AI?

To open or not to open the OpenAI is a hotly debated topic in the industry—the era where people started considering AI development a national asset. The NVIDIA blog on Sovereign AI emphasizes the importance of countries developing artificial intelligence capabilities using local infrastructure, data, and workforce. The article discusses the role of "AI factories" and national efforts worldwide to advance sovereign AI through partnerships, investments, and infrastructure development.

https://blogs.nvidia.com/blog/what-is-sovereign-ai/

Databricks: A Deep Dive into the Latest Performance Improvements of Stateful Pipelines in Apache Spark Structured Streaming

The Databricks blog article delves into the latest enhancements for stateful pipelines in Apache Spark Structured Streaming, focusing on improving performance through optimized memory management, database write/flush performance, and changelog checkpointing. These improvements, part of Project Lightspeed, aim to reduce latency and increase throughput for stateful streaming queries, making them more efficient and cost-effective. The article also highlights sink-specific improvements and operator-specific enhancements that contribute to the overall performance boost.

https://www.databricks.com/blog/deep-dive-latest-performance-improvements-stateful-pipelines-apache-spark-structured-streaming

Zendesk: dbt at Zendesk

The Zendesk team shares their journey of migrating legacy data pipelines to dbt, focusing on making them more reliable, efficient, and scalable. They tackled the challenge by starting with pipelines requiring frequent updates, implementing initial lift and shift to dbt with minimal changes, and then proceeding with basic and advanced refactoring for optimization. The migration enhanced data quality, lineage visibility, performance improvements, cost reductions, and better reliability and scalability, setting a robust foundation for future expansions and onboarding.

dbt at Zendesk — Part I: Setting Foundations for Scalability

dbt at Zendesk — Part 2: supercharging dbt with Dynamic Stage

dbt at Zendesk — Part III: lift-and-shift playbook

Grab: Enabling near real-time data analytics on the data lake

Grab discusses its near real-time data analytics implementation in their data lake, highlighting the transition from traditional data storage to the Hudi format for efficient processing. The article explains the challenges with frequent updates in the vanilla data lake setup and how Hudi allows for faster writes and guarantees atomicity, consistency, isolation, and durability (ACID).

https://engineering.grab.com/enabling-near-realtime-data-analytics

Rippling: Engineering a SIEM: Rippling's security data lakehouse and modular design

The two-part series from Rippling details their journey of building a custom Security Information and Event Management (SIEM) system, emphasizing an engineering-first approach for enhanced security and operational efficiency. Part 1 covers the need for a bespoke SIEM, focusing on scalability, efficient log management, and seamless AWS integration, among other requirements.

Part 2 focuses on its security data lakehouse and its modular design and highlights the implementation of a security data lakehouse using Snowflake, emphasizing efficient log ingestion, modularization with Terraform, and custom log pullers.

Part 1: Why did we need to build our own SIEM?

Part 2: Rippling's security data lakehouse and modular design

DoorDash: Introducing DoorDash’s In-House Search Engine

DoorDash writes about its in-house search engine to address scalability and efficiency issues identified with Elasticsearch's previous search architecture. The new system, built on Apache Lucene, features a segment-replication model, separates indexing from searching traffic, and significantly reduces latency and hardware costs. The system design focuses on a horizontally scalable, general-purpose engine with enhanced query understanding and planning, aiming to support DoorDash's rapid growth and complex document relationships more effectively.

https://doordash.engineering/2024/02/27/introducing-doordashs-in-house-search-engine/

AWS: Data governance in the age of generative AI

The AWS Big Data Blog discusses the importance of data governance in the age of generative AI, emphasizing the need for robust data management strategies to ensure data quality, privacy, and security across structured and unstructured data sources. The article highlights how enterprises can leverage AWS services to manage data governance effectively, ensuring responsible AI use by incorporating comprehensive data governance steps in data pipelines and user interaction workflows. This approach helps maintain accuracy, relevance, and compliance in generative AI applications.

https://aws.amazon.com/blogs/big-data/data-governance-in-the-age-of-generative-ai/

Martin Chesbrough: How to Build a Modern Data Team? Seven tips for success

The article outlines seven tips for building a modern data team: emphasizing the importance of the team's composition and adaptability, advocating for a blend of centralized and decentralized approaches, promoting a full-stack data team concept rather than seeking full-stack engineers, valuing mindset and principles over mere certifications, ensuring the team enjoys their work, and aligning data architecture with organizational structure as per Conway's Law. It stresses the need to understand the technological and human aspects of data team dynamics, including fostering skill liquidity and focusing on first principles for problem-solving over tool-specific knowledge.

https://medium.com/everestengineering/how-to-build-a-modern-data-team-seven-tips-for-success-a4d97e427d45

All rights reserved ProtoGrowth Inc, India. I have provided links for informational purposes and do not suggest endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employer” opinions.

Data Engineering Weekly

Discussion about this post

Ready for more?