Data Engineering Weekly Is Brought to You by RudderStack
RudderStack provides data pipelines that make it easy to collect data from every application, website, and SaaS platform, then activate it in your warehouse and business tools. Sign up free to test out the tool today.
A16Z: Emerging Architectures for LLM Applications
LLM is slowly hitting the enterprise architecture, and every company is looking to see how they can adopt LLM with their private data. A16Z published a reference architecture for the emerging LLM stack. The design focuses on a three-layer system design.
Data preprocessing & Embedding
Prompt construction & retrieval
Prompt execution & inference
https://a16z.com/2023/06/20/emerging-architectures-for-llm-applications/
Amelia Wattenberger: Why Chatbots Are Not the Future
Every attention in the industry on ChatGPT, while the interactive experience is amazing, is that the future of human-to-computer interface? The author thinks not and lists thought-provoking points about how these interfaces work.
The highlight for me from my thinking from the article; ChatGPT can write code and make you a document, but who is responsible for the outcome? What should the human-machine interface look like if the human is responsible for an outcome? How much control human should I have? is chatbots are the right interface for this workflow?
https://wattenberger.com/thoughts/boo-chatbots
Ziheng Wang: Open Vector Data Lakes
We’ve seen the rise of vector databases in the market. The cost, vendor locking, and the properties of the data storage engine are frequent points of discussion in the vector database space. The author argues why the data lakes should support vector format and how the LanceDB data format supports it.
https://blog.lancedb.com/why-dataframe-libraries-need-to-understand-vector-embeddings-291343efd5c8
Galina Alperovich: The Secret Sauce behind 100K context window in LLMs: all tricks in one place
Recently there were several announcements about new Large Language Models (LLMs) that can consume an extremely large context window, such as 65K tokens (MPT-7B-StoryWriter-65k+ by MosaicML) or even 100K tokens (Introducing 100K Context Windows by Antropic). What is the significance of these higher context windows? The author explains about context window in LLM, its limitations, computational complexities, and optimization techniques.
Sponsored: [dbt + Monte Carlo Webinar] 5 Critical Steps for Building Impactful Data Organizations
Building a data strategy from scratch? You're not alone! Join Monte Carlo, dbt, and Shiftkey's VP of Data & Analytics, John Steinmetz. on June 20th as he shares his experiences building the technologies, team structure, and KPIs necessary to align data and analytics success with larger business objectives.
Financial Times: Article Vectorisation Reloaded
Vectorization has a wide variety of applications, but which method to use? How to measure the accuracy? Financial Times writes about evaluating the vectorization model for the following use cases.
Article Clustering
Breadth of Readership
Article Recommendation
Trending Topics
https://medium.com/ft-product-technology/article-vectorisation-reloaded-391084f82549
Eppo: Incremental Pipelines: Managing State at Scale
Switching the gear from the world of LLMs to the data world, Eppo writes an excellent article on building and managing incremental pipelines at scale inspired by Dagster’s asset model. The blog narrates the essence of the asset model below.
Suppose that data asset C depends on B, and B depends on A. A planning phase may look like this:
C is out of date and needs the most recent two days of data from B
B is also out of date and needs the most recent two days of data from A
A is out of date and will need the most recent two days of data from the customer-provided data source
https://www.geteppo.com/blog/incremental-pipelines-managing-state-at-scale
Sponsored: The Data Activation Lifecycle
"Data teams that successfully facilitate data activation do drive more business impact. They’re also able to break out of order-taking purgatory and position themselves as strategic partners within the business."
An end-to-end look at data activation and all of the data engineering work that's required to make it happen. The RudderStack team highlights important details that can help you more efficiently and effectively facilitate data activation, including why you should think of data activation as a continuous cycle, not a linear process.
https://www.rudderstack.com/blog/the-data-activation-lifecycle/
Xavier Gumara Rigol: Navigating the Spectrum of Centralization vs. Decentralization in Analytics Teams
I’ve seen organizations adopting centralized analytics teams, switching to decentralization, and then back to centralization. Deciding between centralization or decentralization is usually not a permanent choice. The author explains how each state will look in an organization.
Etsy: The Problem with Timeseries Data in Machine Learning Feature Systems
🚨 Folks, if you’re using Avro or Protobuf Timestamp data types for collecting events, especially the event timestamp, please don’t. Use the UNIX timestamp with millisecond precision.
If this statement does not convince you, please read the hard lesson learned from Etsy.
Debezium: Towards Debezium exactly-once delivery
Kafka’s exactly-once-delivery semantics triggers a few interesting conversations in the past. Debezium writes a blog about its exactly-once semantics support with Kafka and narrates the case of what will happen if a database connection breaks.
https://debezium.io/blog/2023/06/22/towards-exactly-once-delivery/
David Lexa: Optimizing costs of a Data Lakehouse
Data Lakehouses are growing leaps and bounds from transaction supports and subsecond latency query capabilities to even supporting the vector data structure. The cost is an obvious concern as we add more data, and the author narrates how they are adopting Lake House cost-saving methods to save their storage cost by up to 78%.
https://blog.xendit.engineer/optimizing-costs-of-a-data-lakehouse-3cb6777b2f94
All rights reserved ProtoGrowth Inc, India. I have provided links for informational purposes and do not suggest endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.