Thoughts on Amazon Express One and its impact in Data Infrastructure
Exploring the Impact of Amazon Express One on Data Infrastructure Evolution
AWS S3 Express One Zone sparks some delight in the data infrastructure. In case you missed it, please read the AWS announcement here.
Amazon S3 Express One Zone is a high-performance, single-availability Zone storage class purpose-built to deliver consistent single-digit millisecond data access for your most frequently accessed data and latency-sensitive applications. S3 Express One Zone can improve data access speeds by 10x and reduce request costs by 50% compared to S3 Standard and scales to process millions of requests per minute. While you have always been able to choose a specific AWS Region to store your S3 data, with S3 Express One Zone, you can select a specific AWS Availability Zone within an AWS Region to store your data. You need to co-locate your storage with your compute resources in the same Availability Zone to optimize further performance, which helps lower compute costs and run workloads faster.
Revisiting The Current State of Data Infrastructure
Let’s revisit the current state of the data infrastructure before discussing the S3 Express. There are two critical properties of data warehouse access patterns.
Data freshness matters a lot—the more recent the data, the more frequently it is accessed.
Human in-the-loop, application integration, or machine-driven intelligence can be essential for many near-real-time applications, but subsecond latency is not always necessary. Minute-level latency is often sufficient.
S3 intelligent tiered storage provides a fine balance between the cost and the duration of the data retention. However, the real-time insight on accessing the recent data remains a big challenge. Several tools are trying to solve this problem, as highlighted in the current state of the Data Architecture.
The combination of stream processing + OLAP storage like Pinot.
The caching layer from the BI tools with in-memory databases is on the BI tools side. DuckDB is the recent attempt to build in-process OLAP engines.
The LakeHouse tools, such as Apache Hudi, support incremental querying to reduce the latency.
There are many tooling and architectural patterns to make the recent data more accessible to make business decisions and application integrations.
The balance among Data Freshness, Resource Cost & Query Performance
There is always a Google Paper to discuss the industry's new paradigm shifts. Google published Napa: Powering Scalable Data Warehousing with Robust Query Performance at Google. The paper discusses trade-offs among data freshness, resource cost, and query performance.
In the current state of the data infrastructure, we use a combination of multiple specialized data storage and processing engines to achieve this balance. The Total Cost of Ownership [TCO] and the operational burden are pretty high with the current state of the architecture. Many companies restrict themselves to batch processing with high latency to access the data to keep the design simple enough to manage. Presto tried with RaptorX. Previously, we even tried to query Kafka directly using Presto-Kafka Connector. It doesn’t fly. We tried ingesting all the events in Pinot/ Druid systems, but that comes with its operational cost.
A potential architectural impact with S3 Express
There is hope and disappointment with the S3 Express. Here are a few interesting reads.
I don’t think S3 Express can be a write-through cache system. AWS File Cache tried to play that role. Here is what I think S3 Express can potentially change the current state of the architecture.
S3 Express will open up the serverless data architecture, separating storage and computing from the mainstream data processing industry at all levels. We will see emerging patterns like
Stream ingestion directly into S3 Express [WarpStream already does it]
Replicate S3 Express to S3 Standard for fault tolerance
Stream Processing on top of S3 Express.
Stream processing is an important aspect I believe S3 Express will greatly disrupt. Flink-like systems bring Data to the Query, whereas OLAP engines like Pinot bring Query to Data. I’m in favor of Pinot-like systems that bring Query to Data.
As you can see in the diagram, S3 Express significantly reduces TCO, though it is 8X more expensive than the S3 Standard Storage.
I believe 8X the cost of S3 Express than S3 Standard is more tolerable than operating multiple distributed & stateful systems to achieve the same data processing capabilities.
What is Next?
Software is an abstract layer on top of the storage, computation, and networking devices. Any disruption in the underlying infrastructure will significantly change the value proposition of the software layer. I believe S3 Express is one such change, and I don’t doubt the other major cloud services will follow.
In short, We will see the LakeHouse systems implementing Napa-like architecture to let the users choose the trade-off among data Freshness, Resource Cost, and Query Performance.
The question all boils down to efficient data storage formats with high compression that support faster data access. The next level of data infrastructure software will focus on these two aspects rather than building stateful systems like Kafka. It will amplify the innovations on LakeHouse formats like Apache Hudi, Iceberg & Delta. I like Pinot Data Indexing techniques, which can play a major role in introducing S3 express, especially in the serving & BI layer.