Data Engineering Weekly #172

The Weekly Data Engineering Newsletter

May 20, 2024

Editor’s Note: DEWCon Call for Speakers Open - September 13th, Bengaluru - India

DEWCon is back this year on a grand scale on September 13th, 2024, in Bengaluru, India. We added some additional features to bring the data community together this year.

Book a 1:1 session with experts on career, tech stack, team management, and more!!
"Ideas Jam Session,” where you can talk about your idea/ prototypes in a 10-minute slot

More details on DEWCon will be in the coming weeks, and we will open the registration shortly.

If you want to speak at the conference, propose the talk here: https://forms.gle/21piE4B4e9VuShQ97.

If you want to sponsor DEWCon, express interest here: https://forms.gle/NSWPL6mjJEdR5ERd9.

Vicki Boykis: We've been put in the vibe space

Many companies struggle to integrate LLM into their business applications. The author highlights the importance of minimizing surprises and providing a uniform user experience across business categories by referring to Jakob’s law of UX.

Source: https://lawsofux.com/jakobs-law/

https://vickiboykis.com/2024/05/06/weve-been-put-in-the-vibe-space/

Abraham Thomas: How To Price A Data Asset

How do we provide the value of data and subsequent infrastructure around it? This is a long-standing question for both internal data teams and vendors. It is challenging enough for a research topic. The article about data asset pricing is one of the comprehensive thoughts I came across about pricing models, establishing two basic factors.

Data value depends on the users and the use cases
Data quality is multi-dimensional, and high-quality data costs more.

https://pivotal.substack.com/p/how-to-price-a-data-asset

Chris Riccomini: Nimble and Lance - The Parquet Killers

One of my burning questions is that there is not much going on in the columnar formats. An Empirical Evaluation of Columnar Storage Formats paper compares ORC and Parquet formats regarding efficiency, design choices, and what is lacking in that system. The blog compares the two modern-day alternatives for Parquet, Nimble & Lance.

https://materializedview.io/p/nimble-and-lance-parquet-killers

Mikkel Dengsøe: Data about data from 1,000 conversations with data teams

The blog highlights some of the pressing issues in the industry.

Data Warehouses are increasingly building business-critical applications that lead to the rapid adoption of best practices from software engineering. We have already seen talks about data contracts, data products, etc.
Data teams and their stack are getting larger. The complexity of engineering data exponentially increases as the data size grows.
As testing becomes the core of reliable data system building, data professionals increasingly become domain experts.

https://medium.com/@mikldd/data-about-data-from-1-000-conversations-with-data-teams-bf21496dd7ea

LinkedIn: Building a Large-Scale Recommendation System: People You May Know

LinkedIn's "People You May Know" feature is designed to help users expand their professional networks by suggesting new connections based on shared contacts, interests, and professional backgrounds. Recent updates have focused on improving recommendation quality and reducing bias, ensuring frequent and infrequent users benefit from relevant suggestions.

Diagram of PYMK’s multi-stage ranking system

Notably, empirical research has shown that connections with acquaintances, rather than close friends, are more effective in helping users find new job opportunities, validating the "strength of weak ties" theory.

https://www.linkedin.com/blog/engineering/recommendations/building-a-large-scale-recommendation-system-people-you-may-know

DoubleCloud: Spectrio Boosts Analytics Speed and Cuts Costs with Managed ClickHouse

Whether you are struggling with query speed, infrastructure costs, or vendor-lock, DoubleCloud is here to help you out with the best-in class managed open-source tech! Read on to discover how Spectrio managed to reduce their query latency and infra costs while handling more than 200M+ rows by switching from Snowflake.

https://double.cloud/resources/case-studies/spectrio-cut-costs-and-boosted-analytics-speed-with-doublecloud/

Pinterest: HBase Deprecation at Pinterest

Pinterest writes about its multi-year journey of deprecating HBase with specialized databases such as Druid, Goku (an in-house time-series db), and TiDB. The blog highlights the key factors that lead to HBase deprecation.

Complexity and Maintenance Overhead
Scalability and Performance Issues
Lack of Essential Features

https://medium.com/pinterest-engineering/hbase-deprecation-at-pinterest-8a99e6c8e6b7

Thumbtack: Unifying Machine Learning Model Inference at Thumbtack

Thumbtack writes about unifying its machine learning model inference process to streamline deployment and improve efficiency. By consolidating various models into a single inference platform, Thumbtack better manages model versions, reduces latency, and scales its infrastructure seamlessly. This approach not only enhances performance but also simplifies the maintenance and deployment pipeline for their machine learning operations.

https://medium.com/thumbtack-engineering/unifying-machine-learning-model-inference-at-thumbtack-eacb08d11680

BuzzFeed: Crafting Analytical Summaries with Chat GPT

The explainability of an analysis is one aspect I think LLM can help data practitioners. BuzzFeed writes about the same using ChatGPT to automate the creation of analytical summaries, blending data from various sources like SQL databases and Google Sheets. This approach improves efficiency by generating concise and accurate summaries, helping the team gain insights and make data-driven decisions.

https://tech.buzzfeed.com/crafting-analytical-summaries-with-chat-gpt-1316ba5fbe7c

Picnic: Enhancing Search Retrieval with Large Language Models (LLMs)

Picnic uses large language models (LLMs) to enhance its search retrieval system, enabling more accurate and contextually relevant search results. By leveraging LLMs, Picnic improves the understanding of user queries and the matching of these queries to relevant products. I’m doing some prototypes in blending keyword search and semantic search, and this is an existing domain to keep track of.

https://blog.picnic.nl/enhancing-search-retrieval-with-large-language-models-llms-7c3748b26d72

Agoda: How We Solve Load Balancing Challenges in Apache Kafka

Agoda writes about how it solved the load-balancing challenges with Apache Kafka. The model essentially falls into two categories.

Lag-aware producers (if there is only a finite and known set of producers in the system)
Lag-aware consumers (if there is no finite set of producers in the system)

The blog discusses various algorithmic approaches one can take for both the approaches and their trade-offs.

https://medium.com/agoda-engineering/how-we-solve-load-balancing-challenges-in-apache-kafka-8cd88fdad02b

All rights reserved ProtoGrowth Inc, India. I have provided links for informational purposes and do not suggest endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employer” opinions.

Data Engineering Weekly

Discussion about this post

Ready for more?