A Critique of Iceberg REST Catalog: A Classic Case of Why Semantic Spec Fails

How a Semantically Correct API Becomes Operationally Unreliable at Scale

Jan 09, 2026

“Latency is not just a performance characteristic; it is a fundamental part of correctness.” — Designing Data-Intensive Applications

In Designing Data-Intensive Applications, Martin Kleppmann makes a subtle but critical point: the CAP theorem omits latency, yet in real systems, latency often determines whether a system is usable at all. A system that is correct but slow is, in practice, incorrect.

This observation is directly applicable to the Apache Iceberg REST Catalog specification. While the specification achieves semantic clarity, it fails to define the operational realities that enable distributed systems to remain predictable at scale. The result is a standard that is formally correct, yet operationally fragile.

Semantic Interoperability Without Predictability

Over the past two years, the Iceberg REST Catalog specification has emerged as the de facto standard for metadata access in the Iceberg ecosystem. We have seen the outburst of the catalog war around the REST spec. It promises a universal interface that allows engines such as Trino, Spark, Flink, and StarRocks to interact with Iceberg tables via a common REST abstraction, independent of the underlying catalog implementation.

At the semantic level, this promise largely holds. The specification rigorously defines metadata structures: tables, schemas, snapshots, and namespace operations. A LoadTable or CreateNamespace request looks identical across implementations. This semantic interoperability has been critical to Iceberg’s rapid ecosystem adoption.

However, semantic interoperability alone is insufficient. The specification defines what metadata operations mean, but it avoids specifying how they must behave in real-world conditions, such as concurrency, latency sensitivity, and cross-catalog synchronization.

This gap—between semantic interoperability and operational interoperability—is where systems begin to fail in production.

The Core Problem: No Operational SLA, No Predictability

The Iceberg REST Catalog specification is intentionally silent on performance guarantees. There are no latency expectations, no throughput baselines, and no service-level objectives. While this flexibility lowers the barrier to implementation, it creates an ecosystem where:

Two catalogs can both be “compliant” yet differ by orders of magnitude in response time.
Clients cannot reason about metadata latency during query planning.
Synchronization behavior across catalogs becomes unpredictable.

In distributed data systems, predictability matters more than raw performance. Without a strict operational SLA—or at least defined behavioral constraints—clients are forced into defensive, retry-heavy designs that amplify load and increase tail latency.

The “List Tables” Problem: Cross-Catalog Sync Failure

The ListTables endpoint (GET /v1/namespaces/{namespace}/tables) is semantically straightforward. It allows clients to enumerate tables within a namespace and supports pagination through pageSize and pageToken.

The primary issue is not pagination itself. The real failure emerges when the same Iceberg tables are registered in multiple catalogs, a pattern that is increasingly common in hybrid and multi-platform deployments.

A Realistic Scenario

An Iceberg table is registered in Catalog A and Catalog B
Both catalogs point to the same underlying metadata and object storage.
One catalog is used by ingestion and streaming workloads.
Analytics engines or BI tools use the other.

The Sync Pathology

When a client connects to Catalog B and issues a metadata discovery operation—such as listing tables or syncing namespace state—the catalog must:

Enumerate all tables
Resolve metadata pointers
Validate access permissions
Reconcile the state with the underlying storage.

Because the REST specification defines no operational expectations:

There is no SLA for how long this sync should take
There is no distinction between a “lightweight” listing and a fully validated listing.
There is no mechanism to express intent (e.g., names only, no ACL validation)

As table counts grow into the tens of thousands, synchronization latency grows non-linearly. In practice, sync operations can take minutes—or fail—causing engines to stall, time out, or repeatedly retry.

The result is not merely slow metadata access. It is system-wide unpredictability. Query engines cannot determine whether a delay is transient, systemic, or catastrophic.

Latency Is Treated as an Implementation Detail—But It Is a Contract

The REST Catalog specification implicitly treats latency as an implementation concern. From a standards perspective, this is understandable. But in data-intensive systems, latency is part of the correctness contract.

The specification does not define:

Upper bounds on metadata retrieval latency
Maximum metadata payload sizes
Limits on metadata fan-out operations
The number of round-trip required to plan a query

As a result, a compliant catalog may require megabytes of JSON metadata and dozens of HTTP calls just to validate a single query plan. Engines appear slow and unstable, even though the root cause lies in an underspecified protocol.

This is precisely the class of problem Kleppmann warns about: correctness without latency guarantees is operationally meaningless.

Commit Semantics Under Contention: Undefined and Unfair

Iceberg relies on optimistic concurrency control. When multiple writers attempt to commit simultaneously, conflicts are expected and resolved through retries.

The REST specification defines the 409 Conflict response, but stops there. It does not define:

Backoff expectations
Retry fairness
Starvation prevention

In a multi-engine environment, this creates asymmetric outcomes. A high-frequency streaming writer with aggressive retries can permanently starve batch compaction jobs that follow conservative retry policies. Over time, table health degrades due to file explosion and unbounded metadata growth.

Once again, the issue is not semantic correctness. It is the absence of operational guarantees.

Caching Without a Freshness Model

While HTTP caching is permitted, it is not part of the correctness model. Support for conditional requests, ETags, or freshness validation is optional.

This forces clients into a pessimistic stance: always re-fetch, always revalidate, always assume staleness. The REST protocol degenerates into a chatty, high-latency control plane that negates its own architectural benefits.

Without a standardized freshness contract, caching becomes a gamble rather than a reliability tool.

Behavioral Conformance Is Missing

The Iceberg ecosystem has strong conformance testing for table formats. It lacks an equivalent for catalog behavior.

Today, “REST Catalog compliant” means:

The endpoints exist
The JSON schema is correct.
The happy path works.

It does not mean:

Predictable latency under load
Stable pagination during concurrent updates
Graceful overload signaling
Bounded retry amplification

Without behavioral conformance tests, compliance guarantees syntax, not operability.

Underspecification Is Still a Design Decision

The absence of operational constraints is not accidental. It reflects a deliberate choice to prioritize adoption and flexibility.

However, in distributed systems, underspecification pushes complexity downstream. It burdens clients, operators, and platform teams with the need to implement compensating logic. As Iceberg becomes core infrastructure rather than experimental tooling, this trade-off increasingly limits its reliability.

Semantic agreement without behavioral agreement leads to fragile systems.

Toward Operational Interoperability

Operational interoperability does not require rigid SLAs or centralized control. It requires acknowledging that latency, retries, and fairness are part of the interface.

Concrete improvements could include:

Defined operational profiles with minimum latency and concurrency expectations
Lightweight metadata views to avoid synchronization amplification
Standardized retry and backoff semantics for conflict scenarios
Explicit freshness and caching contracts

Semantic interoperability enabled Iceberg’s success. Operational interoperability will determine whether it remains dependable at scale.

Until then, the Iceberg REST Catalog remains a textbook example of why semantic specifications alone are not enough.

All rights reserved, Dewpeche Private Limited. I have provided links for informational purposes and do not suggest endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.

Jon

Jan 12

Do you think the Iceberg spec was intended to solve all those problems?

I think of specs and protocols like HTTP or gRPC. They define contracts for systems to communicate to one another, but there's no built-in definition for how long a request/response should take. That varies wildly across use cases.

So yeah, I agree that a semantic spec isn't sufficient for Iceberg as a technology. But that's the case with everything, right? I can't look at some service and say "oh, they provide RESTful API endpoints, that's all I need to know that it will work for my use case!" But it's sure nice that most web services use standard protocols, rather than having bespoke binary protocols for every integration under the sun.

3 replies by Ananth Packkildurai and others

Jan 11

Thank you! Very helpful!

7 more comments...

Data Engineering Weekly

Discussion about this post

Ready for more?