Data Engineering Weekly #136

The Weekly Data Engineering Newsletter

Jun 26, 2023

Data Engineering Weekly Is Brought to You by RudderStack

RudderStack provides data pipelines that make it easy to collect data from every application, website, and SaaS platform, then activate it in your warehouse and business tools. Sign up free to test out the tool today.

A16Z: Emerging Architectures for LLM Applications

LLM is slowly hitting the enterprise architecture, and every company is looking to see how they can adopt LLM with their private data. A16Z published a reference architecture for the emerging LLM stack. The design focuses on a three-layer system design.

Data preprocessing & Embedding
Prompt construction & retrieval
Prompt execution & inference

https://a16z.com/2023/06/20/emerging-architectures-for-llm-applications/

Amelia Wattenberger: Why Chatbots Are Not the Future

Every attention in the industry on ChatGPT, while the interactive experience is amazing, is that the future of human-to-computer interface? The author thinks not and lists thought-provoking points about how these interfaces work.

The highlight for me from my thinking from the article; ChatGPT can write code and make you a document, but who is responsible for the outcome? What should the human-machine interface look like if the human is responsible for an outcome? How much control human should I have? is chatbots are the right interface for this workflow?

https://wattenberger.com/thoughts/boo-chatbots

Ziheng Wang: Open Vector Data Lakes

We’ve seen the rise of vector databases in the market. The cost, vendor locking, and the properties of the data storage engine are frequent points of discussion in the vector database space. The author argues why the data lakes should support vector format and how the LanceDB data format supports it.

https://blog.lancedb.com/why-dataframe-libraries-need-to-understand-vector-embeddings-291343efd5c8

Galina Alperovich: The Secret Sauce behind 100K context window in LLMs: all tricks in one place

Recently there were several announcements about new Large Language Models (LLMs) that can consume an extremely large context window, such as 65K tokens (MPT-7B-StoryWriter-65k+ by MosaicML) or even 100K tokens (Introducing 100K Context Windows by Antropic). What is the significance of these higher context windows? The author explains about context window in LLM, its limitations, computational complexities, and optimization techniques.

https://blog.gopenai.com/how-to-speed-up-llms-and-use-100k-context-window-all-tricks-in-one-place-ffd40577b4c

Financial Times: Article Vectorisation Reloaded

Vectorization has a wide variety of applications, but which method to use? How to measure the accuracy? Financial Times writes about evaluating the vectorization model for the following use cases.

Article Clustering
Breadth of Readership
Article Recommendation
Trending Topics

https://medium.com/ft-product-technology/article-vectorisation-reloaded-391084f82549

Eppo: Incremental Pipelines: Managing State at Scale

Switching the gear from the world of LLMs to the data world, Eppo writes an excellent article on building and managing incremental pipelines at scale inspired by Dagster’s asset model. The blog narrates the essence of the asset model below.

Suppose that data asset C depends on B, and B depends on A. A planning phase may look like this:

C is out of date and needs the most recent two days of data from B
B is also out of date and needs the most recent two days of data from A
A is out of date and will need the most recent two days of data from the customer-provided data source

https://www.geteppo.com/blog/incremental-pipelines-managing-state-at-scale

Xavier Gumara Rigol: Navigating the Spectrum of Centralization vs. Decentralization in Analytics Teams

I’ve seen organizations adopting centralized analytics teams, switching to decentralization, and then back to centralization. Deciding between centralization or decentralization is usually not a permanent choice. The author explains how each state will look in an organization.

https://xgumara.medium.com/navigating-the-spectrum-of-centralization-vs-decentralization-in-analytics-teams-eb6eb240917b

Etsy: The Problem with Timeseries Data in Machine Learning Feature Systems

🚨 Folks, if you’re using Avro or Protobuf Timestamp data types for collecting events, especially the event timestamp, please don’t. Use the UNIX timestamp with millisecond precision.

If this statement does not convince you, please read the hard lesson learned from Etsy.

https://www.etsy.com/codeascraft/the-problem-with-timeseries-data-in-machine-learning-feature-systems

Debezium: Towards Debezium exactly-once delivery

Kafka’s exactly-once-delivery semantics triggers a few interesting conversations in the past. Debezium writes a blog about its exactly-once semantics support with Kafka and narrates the case of what will happen if a database connection breaks.

https://debezium.io/blog/2023/06/22/towards-exactly-once-delivery/

David Lexa: Optimizing costs of a Data Lakehouse

Data Lakehouses are growing leaps and bounds from transaction supports and subsecond latency query capabilities to even supporting the vector data structure. The cost is an obvious concern as we add more data, and the author narrates how they are adopting Lake House cost-saving methods to save their storage cost by up to 78%.

https://blog.xendit.engineer/optimizing-costs-of-a-data-lakehouse-3cb6777b2f94

All rights reserved ProtoGrowth Inc, India. I have provided links for informational purposes and do not suggest endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.

Data Engineering Weekly