Data Engineering Weekly #166

The Weekly Data Engineering Newsletter

Apr 07, 2024

dbt: 2024 State of Analytics Engineering

The 2024 dbt’s state of analytical engineering report is out. Poor data quality and unlcear data ownership remains the top challenges for the data teams. Data Mesh continuously gaining popularity among the enterprises. It is a stark difference from the Gartner report about data mesh. I guess only the time will tell who wins in the data mesh vs data fabric war.

https://www.getdbt.com/resources/reports/state-of-analytics-engineering-2024

Matt Turck: Full Steam Ahead: The 2024 MAD (Machine Learning, AI & Data) Landscape

Coninue the week of insights into the world of data & AI landscape, the 2024 MAD landscape is out. The report pointed out the rise of LLM makes the unstructured data more important than ever, pressure to the Modern Data stack will continue to intensify as the cost of integration remains high, and the rise of “Modern AI Stack”

https://mattturck.com/mad2024/

EvalPlus: EvalPlus Leaderboard - EvalPlus evaluates AI Coders with rigorous tests

Will AI replace the coders? What will the future of software engineers be? EvalPlus builds a leadership board to demonstrate the efficiency of leading AI coder models.

https://evalplus.github.io/leaderboard.html

Chase Roberts: Data Council 2024 - The future data stack is composable, and other hot takes

The author reflects Data Council 2024 conversations, the most popular data conference in the USA. The emerging of composable data stack, and open data stack is certainly an interesting trend to watch. A key highlight for me,

I spoke to multiple data people stuck in legacy systems and still inching their way to the cloud. VCs have moved on from data catalogs, yet practitioners told me they look forward to solving data discovery.

https://medium.com/vvus/data-council-2024-the-future-data-stack-is-composable-and-other-hot-takes-b6c5f2429e22

Pinterest: How we built Text-to-SQL at Pinterest

Last week Intuit shared its key learning building Text 2 SQL, and Pinterest publishes the tech deep dive on how its internal Text2SQL work. The highlight for me is,

There is an ongoing table standardization effort at Pinterest to add tiering for the tables. We index only top-tier tables, promoting the use of these higher-quality datasets.

I strongly believe the concept of Data Product will play a bigger role in data engineering. It is evident that it will become the foundation of trusted sources, which is essential to taking advantage of advancements from LLMs.

https://medium.com/pinterest-engineering/how-we-built-text-to-sql-at-pinterest-30bad30dabff

Spotify: Data Platform Explained

In the ever-evolving landscape of data-driven decision-making, a well-structured data platform emerges as a critical asset. Spotify shares some of the critical triggers in an organization that leads to build data platform.

https://engineering.atspotify.com/2024/04/data-platform-explained/

Replit: Building LLMs for Code Repair

Replit has developed a native AI model for code repair, leveraging the Language Server Protocol (LSP) diagnostics and operational transformations (OTs) to train a large language model (LLM) that fixes code errors directly within its IDE. This initiative aims to significantly reduce developers' time spent on debugging by improving the AI's understanding and interaction with the development environment. The model is trained using a dataset of code-diagnostic pairs and fine-tuned to predict line diffs that correct LSP-identified errors, showing promising results against larger models and existing benchmarks.

https://blog.replit.com/code-repair

Hussein Jundi: Data Engineering - Architectures & Strategies for Handling Sensitive Data

The rapid adoption of AI brings challenges for data engineering to design systems to handling sensitive data. The author writes a comprehensive article on strategies to handle sensitive data, maturity level of each organizations and how the solution differ for each maturity levels.

https://blog.det.life/data-engineering-architectures-strategies-for-handling-sensitive-data-83292b997c17

Picnic: YAML developers and the declarative data platforms

Forget Modern Data Stack, Have you ever wonder what is Declarative Data Stack? The blog takes an example of SQL as an evidence of the success of a declartive language. For the lack of better wording, we should further classify declartive languages as dynamic and static. SQL is a dynamic declarive language where one can express complex constraints, where YAML pretty much a static rule engine.

The declarative paradigm is an abstraction on top of imperative statements performed by systems

https://blog.picnic.nl/yaml-developers-and-the-declarative-data-platforms-4719b7a1311c

All rights reserved ProtoGrowth Inc, India. I have provided links for informational purposes and do not suggest endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employer” opinions.

Data Engineering Weekly

Discussion about this post

Ready for more?