A Critique of Iceberg REST Catalog: A Classic Case of Why Semantic Spec Fails
How a Semantically Correct API Becomes Operationally Unreliable at Scale
“Latency is not just a performance characteristic; it is a fundamental part of correctness.” — Designing Data-Intensive Applications
In Designing Data-Intensive Applications, Martin Kleppmann makes a subtle but critical point: the CAP theorem omits latency, yet in real systems, latency often determines whether a system is usable at all. A system that is correct but slow is, in practice, incorrect.
This observation is directly applicable to the Apache Iceberg REST Catalog specification. While the specification achieves semantic clarity, it fails to define the operational realities that enable distributed systems to remain predictable at scale. The result is a standard that is formally correct, yet operationally fragile.
Semantic Interoperability Without Predictability
Over the past two years, the Iceberg REST Catalog specification has emerged as the de facto standard for metadata access in the Iceberg ecosystem. We have seen the outburst of the catalog war around the REST spec. It promises a universal interface that allows engines such as Trino, Spark, Flink, and StarRocks to interact with Iceberg tables via a common REST abstraction, independent of the underlying catalog implementation.
At the semantic level, this promise largely holds. The specification rigorously defines metadata structures: tables, schemas, snapshots, and namespace operations. A LoadTable or CreateNamespace request looks identical across implementations. This semantic interoperability has been critical to Iceberg’s rapid ecosystem adoption.
However, semantic interoperability alone is insufficient. The specification defines what metadata operations mean, but it avoids specifying how they must behave in real-world conditions, such as concurrency, latency sensitivity, and cross-catalog synchronization.
This gap—between semantic interoperability and operational interoperability—is where systems begin to fail in production.
The Core Problem: No Operational SLA, No Predictability
The Iceberg REST Catalog specification is intentionally silent on performance guarantees. There are no latency expectations, no throughput baselines, and no service-level objectives. While this flexibility lowers the barrier to implementation, it creates an ecosystem where:
Two catalogs can both be “compliant” yet differ by orders of magnitude in response time.
Clients cannot reason about metadata latency during query planning.
Synchronization behavior across catalogs becomes unpredictable.
In distributed data systems, predictability matters more than raw performance. Without a strict operational SLA—or at least defined behavioral constraints—clients are forced into defensive, retry-heavy designs that amplify load and increase tail latency.
The “List Tables” Problem: Cross-Catalog Sync Failure
The ListTables endpoint (GET /v1/namespaces/{namespace}/tables) is semantically straightforward. It allows clients to enumerate tables within a namespace and supports pagination through pageSize and pageToken.
The primary issue is not pagination itself. The real failure emerges when the same Iceberg tables are registered in multiple catalogs, a pattern that is increasingly common in hybrid and multi-platform deployments.
A Realistic Scenario
An Iceberg table is registered in Catalog A and Catalog B
Both catalogs point to the same underlying metadata and object storage.
One catalog is used by ingestion and streaming workloads.
Analytics engines or BI tools use the other.
The Sync Pathology
When a client connects to Catalog B and issues a metadata discovery operation—such as listing tables or syncing namespace state—the catalog must:
Enumerate all tables
Resolve metadata pointers
Validate access permissions
Reconcile the state with the underlying storage.
Because the REST specification defines no operational expectations:
There is no SLA for how long this sync should take
There is no distinction between a “lightweight” listing and a fully validated listing.
There is no mechanism to express intent (e.g., names only, no ACL validation)
As table counts grow into the tens of thousands, synchronization latency grows non-linearly. In practice, sync operations can take minutes—or fail—causing engines to stall, time out, or repeatedly retry.
The result is not merely slow metadata access. It is system-wide unpredictability. Query engines cannot determine whether a delay is transient, systemic, or catastrophic.
Latency Is Treated as an Implementation Detail—But It Is a Contract
The REST Catalog specification implicitly treats latency as an implementation concern. From a standards perspective, this is understandable. But in data-intensive systems, latency is part of the correctness contract.
The specification does not define:
Upper bounds on metadata retrieval latency
Maximum metadata payload sizes
Limits on metadata fan-out operations
The number of round-trip required to plan a query
As a result, a compliant catalog may require megabytes of JSON metadata and dozens of HTTP calls just to validate a single query plan. Engines appear slow and unstable, even though the root cause lies in an underspecified protocol.
This is precisely the class of problem Kleppmann warns about: correctness without latency guarantees is operationally meaningless.
Commit Semantics Under Contention: Undefined and Unfair
Iceberg relies on optimistic concurrency control. When multiple writers attempt to commit simultaneously, conflicts are expected and resolved through retries.
The REST specification defines the 409 Conflict response, but stops there. It does not define:
Backoff expectations
Retry fairness
Starvation prevention
In a multi-engine environment, this creates asymmetric outcomes. A high-frequency streaming writer with aggressive retries can permanently starve batch compaction jobs that follow conservative retry policies. Over time, table health degrades due to file explosion and unbounded metadata growth.
Once again, the issue is not semantic correctness. It is the absence of operational guarantees.
Caching Without a Freshness Model
While HTTP caching is permitted, it is not part of the correctness model. Support for conditional requests, ETags, or freshness validation is optional.
This forces clients into a pessimistic stance: always re-fetch, always revalidate, always assume staleness. The REST protocol degenerates into a chatty, high-latency control plane that negates its own architectural benefits.
Without a standardized freshness contract, caching becomes a gamble rather than a reliability tool.
Behavioral Conformance Is Missing
The Iceberg ecosystem has strong conformance testing for table formats. It lacks an equivalent for catalog behavior.
Today, “REST Catalog compliant” means:
The endpoints exist
The JSON schema is correct.
The happy path works.
It does not mean:
Predictable latency under load
Stable pagination during concurrent updates
Graceful overload signaling
Bounded retry amplification
Without behavioral conformance tests, compliance guarantees syntax, not operability.
Underspecification Is Still a Design Decision
The absence of operational constraints is not accidental. It reflects a deliberate choice to prioritize adoption and flexibility.
However, in distributed systems, underspecification pushes complexity downstream. It burdens clients, operators, and platform teams with the need to implement compensating logic. As Iceberg becomes core infrastructure rather than experimental tooling, this trade-off increasingly limits its reliability.
Semantic agreement without behavioral agreement leads to fragile systems.
Toward Operational Interoperability
Operational interoperability does not require rigid SLAs or centralized control. It requires acknowledging that latency, retries, and fairness are part of the interface.
Concrete improvements could include:
Defined operational profiles with minimum latency and concurrency expectations
Lightweight metadata views to avoid synchronization amplification
Standardized retry and backoff semantics for conflict scenarios
Explicit freshness and caching contracts
Semantic interoperability enabled Iceberg’s success. Operational interoperability will determine whether it remains dependable at scale.
Until then, the Iceberg REST Catalog remains a textbook example of why semantic specifications alone are not enough.
All rights reserved, Dewpeche Private Limited. I have provided links for informational purposes and do not suggest endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employers’ opinions.



Sharp analysis on the operational vs semantic gap in REST specs. The ListTables sync pathology you describe is exactly what breaks at scaleexcept most teams blame the catalog implementation rather than the underspecified protocol. I've debugged similar issues where query planners stall because there's no bounded expectation for metadata latency, so they can't distinguish between a slow catalog and a broken one. The commit contention problem is worse than people realize becuase aggressive retry clients don't just starve others, they amplify load during exactly the moments when the system is already stressed.
This was interesting