The State of Data Engineering in 2024: Key Insights and Trends
A Look Back at the Year's Defining Patterns in Data Engineering
As we reflect on 2024, the data engineering landscape has undergone significant transformations driven by technological advancements, changing business needs, and the meteoric rise of artificial intelligence. This comprehensive analysis examines the key trends and patterns that shaped data engineering practices throughout the year.
The GenAI Revolution in Data Engineering
Integrating Generative AI (GenAI) and Large Language Models (LLMs) into data platforms emerged as the most transformative trend 2024. Organizations across industries moved beyond experimental phases to implement production-ready GenAI solutions within their data infrastructure.
Natural Language Interfaces
Companies like Uber, Pinterest, and Intuit adopted sophisticated text-to-SQL interfaces, democratizing data access across their organizations. Tools like Uber’s QueryGPT and Pinterest’s text-to-SQL solution bridge the gap between business users and data by allowing natural language queries. These solutions go beyond basic query generation to prioritize accuracy, security, and compliance.
Automated Data Classification and Governance
LLMs are reshaping governance practices. Grab’s Metasense, Uber’s DataK9, and Meta’s classification systems use AI to automatically categorize vast data sets, reducing manual efforts and improving accuracy. Beyond classification, organizations now use AI for automated metadata generation and data lineage tracking, creating more intelligent data infrastructures.
Development and Security Frameworks
Structured frameworks have become essential to ensure effective GenAI implementation. Companies like Uber and Grab have developed toolkits like the Prompt Engineering Toolkit and LLM-Kit, which focus on:
Prompt management and version control
Security and compliance guardrails
Performance monitoring and cost optimization
These frameworks address critical challenges, standardizing how LLMs integrate into data workflows while mitigating risks like high costs or compliance breaches.
Evolution of Data Lake Technologies
The data lake ecosystem has matured significantly in 2024, particularly in table formats and storage technologies.
S3 Tables and Cloud Integration
AWS’s introduction of S3 Tables marked a pivotal shift, enabling faster queries and easier management. Building on Apache Iceberg’s foundation, S3 Tables integrates storage and compute layers, yielding up to 3x performance improvements through optimized query planning and compaction strategies.
Table Formats Standardization
Delta Lake, Apache Hudi, and Apache Iceberg have competed fiercely in 2024, each offering unique strengths:
Delta Lake: ACID compliance and cloud optimization
Apache Hudi: Real-time ingestion and upsert capabilities
Apache Iceberg: A wide vendor ecosystem and scalable warehouse design
Organizations like Flipkart and Grab have shared implementation insights, helping others navigate these options and make informed adoption decisions.
The Battle for Catalog Supremacy
2024 witnessed intense competition in the catalog space, highlighting the strategic importance of metadata management in modern data architectures. Databricks' acquisition of Tabular and the subsequent open-sourcing of Unity Catalog, followed by Snowflake's release of the open-source Polaris Catalog, marked a significant shift in the industry's data governance and discovery approach. Despite their "open-source" nature, these catalogs often remain tightly coupled with their respective commercial platforms, challenging the fundamental promise of open table formats. While vendors strive to provide optimal integrated experiences for their customers, this fragmentation increases operational complexity and business costs, making it harder to scale operations or adopt new technologies.
Vector Search and Unstructured Data Processing
Advancements in Search Architecture
In 2024, organizations redefined search technology by adopting hybrid architectures that combine traditional keyword-based methods with advanced vector-based approaches. These systems address the increasing complexity of search queries, blending semantic understanding with precise ranking processes to deliver highly relevant results. LinkedIn, for example, implemented a two-layer search engine that integrates a retrieval layer capable of selecting thousands of candidate posts from billions of options with a multi-stage ranking layer that scores results with remarkable precision. This architecture incorporates real-time query processing and semantic search capabilities, enabling faster and more accurate content discovery. Similarly, Instacart engineered a hybrid retrieval system leveraging pgvector within PostgreSQL, striking a balance between precision and query coverage. This system enhances product discovery by combining the strengths of traditional keyword searches and vector-based techniques, optimizing search relevance for customer queries.
Other organizations, like Grab and Figma, have pushed search technology boundaries, focusing on performance and scalability. Grab’s search infrastructure employs a two-step vector similarity approach enhanced with large language model (LLM)-based ranking and natural language understanding, enabling superior accuracy and relevance for user queries. Meanwhile, Figma designed a scalable vector search infrastructure to manage high-dimensional data while minimizing computing costs. This system prioritizes real-time processing, ensuring seamless interaction for users navigating complex design workflows. Across industries, these advancements reflect a growing trend toward hybrid systems emphasizing precision, real-time capabilities, and cost-efficient scalability—making them indispensable in an era of increasing data complexity.
Innovations in Unstructured Data Processing
Processing unstructured data at scale remains one of the biggest challenges for modern organizations, prompting innovative solutions in 2024 that blend efficiency, scalability, and accuracy. Companies like Thomson Reuters Labs have revolutionized document processing by leveraging modern formats like Parquet and Arrow, which optimize data storage and retrieval for high volumes of unstructured information. These innovations enable faster document parsing and reduce processing overhead, ensuring pipelines can scale without performance bottlenecks. Thomson Reuters Labs also adopted advanced parallel processing techniques to streamline workflows, drastically improving system efficiency. By focusing on scalable document understanding pipelines, the company demonstrated how to manage large-scale unstructured data efficiently while reducing the computational burden.
In addition to file format optimizations, organizations have increasingly adopted machine learning and vision-language models to bypass traditional methods like OCR (optical character recognition). Tools like ColPali now leverage these models to analyze visual documents directly, improving document similarity search accuracy and enabling faster processing of data that combines text and images. These advancements are particularly transformative for industries dealing with large volumes of unstructured visual content, such as legal documents, receipts, and research papers. Together, these innovations reflect a growing industry-wide focus on tools and frameworks that process unstructured data more intelligently and cost-effectively, opening new possibilities for analyzing complex, unformatted datasets at unprecedented scales.
Data Quality and Governance Evolution
In 2024, the data quality and governance landscape has transformed as organizations prioritize automation, decentralization, and proactive frameworks to ensure data reliability and compliance. While some have expressed less optimistic views about data mesh and data contracts, many real-world success stories emerge that demonstrate their value. This evolution reflects a broader shift toward scalability, agility, and enhanced governance across data ecosystems.
Automated Quality Monitoring Systems
Expedia has developed a Service-Level Objective (SLO) platform powered by Kafka for event streaming and PostgreSQL for efficient data storage, delivering near real-time insights through APIs and leveraging DataDog integration to identify data quality issues and minimize disruptions proactively. Similarly, Adevinta has introduced its decentralized Artemis system, which enables individual teams to define custom data quality checks, establish automated alerting mechanisms, and resolve issues proactively. This approach fosters team autonomy while maintaining high-quality standards.
Swiggy ensures consistent mobile application event collection through its automated event verification framework, which validates data at the source, automates contract validation workflows, reduces pipeline errors early, and establishes a reliable foundation for data processing. Meanwhile, Yelp has extended dbt’s generic test framework to standardize quality checks for data marts, implement domain-specific validation rules, and automate quality assurance processes, ensuring consistent testing and validation across datasets.
Data Mesh, Data Products, and Data Contracts
Miro exemplifies the shift toward metadata-driven workflows by transitioning from Airflow code to DataHub YAML specifications. The YAML definition enables the establishment of explicit data contracts that clarify stakeholder responsibilities and simplify contract management. Uber demonstrates the transformative potential of Data Mesh. It decentralizes data ownership while maintaining governance standards through self-serve infrastructure, standardized data contracts, and automated governance processes.
Similarly, Notion underscores the critical role of metadata management through its data catalog initiative, which integrates metadata seamlessly into workflows by leveraging a strong data platform foundation, clear ownership models, and automated metadata collection. Meanwhile, Next Insurance strengthens governance and standardization with its DAGLint implementation, a tool that enforces best practices, optimizes workflow structures, and ensures quality assurance across development patterns.
Emerging Best Practices in Data Governance
As organizations embrace innovation in data quality and governance, several best practices have emerged as industry standards:
Shift from reactive to proactive quality management
Standardization of quality assurance processes
Automation of validation and monitoring
Implementation of clear data contracts
Enhanced metadata management
Decentralized ownership with centralized governance
Cost Optimization and Performance Tuning
The relentless growth of data volumes and cloud computing costs has made cost optimization and performance tuning critical priorities in 2024. Organizations have implemented sophisticated strategies to balance performance requirements with cost efficiency.
Query Optimization and Cost Attribution
By optimizing their most expensive pipelines, Medium's engineering team demonstrated significant cost savings in their Snowflake environment. Their approach focused on:
Identifying and eliminating redundant data processing
Optimizing JOIN operations and table structures
Implementing efficient incremental processing patterns
Reducing unnecessary data movements
GreyBeam's deep dive into Snowflake's query cost attribution revealed the importance of granular cost monitoring and optimization. Their analysis provided frameworks for measuring per-query costs, enabling teams to identify and optimize expensive operations proactively.
Infrastructure Cost Management
PayPal achieved remarkable results by leveraging Spark 3 and NVIDIA's GPUs, reducing cloud costs by up to 70% for their big data pipelines. Their implementation demonstrated how hardware acceleration could dramatically improve performance and cost efficiency.
DoorDash's implementation of Kafka multi-tenancy showcases how architectural decisions can significantly impact infrastructure costs. Their approach includes:
Resource sharing across multiple applications
Efficient capacity planning
Automated resource management
Optimized storage utilization
Best Practices Emerging in Performance & Cost Optimization
Several key patterns have emerged in 2024 with cost & performance optimization:
Move from reactive to proactive cost management
Implement granular monitoring and attribution
Consider cost implications during architectural design
Balance performance requirements with cost efficiency
Conclusion
2024 has been a transformative year for data engineering, with AI technologies becoming mainstream, data lake solutions maturing, and efficiency and governance taking center stage. Organizations have moved beyond theory to implement real-world solutions that address complex challenges.
What is ahead of us in 2025? Later this week, we will publish DEW's prediction for 2025 and beyond. Stay Tuned.
All rights reserved ProtoGrowth Inc, India. I have provided links for informational purposes and do not suggest endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employer” opinions.