REST Spec: Add resolve endpoint for catalog objects#15830
Conversation
| view: | ||
| $ref: '#/components/schemas/LoadViewResult' | ||
|
|
||
| LoadRelationResult: |
There was a problem hiding this comment.
Future materialized-view would look like
{
"object-type": "materialized-view",
"view": { },
"storage-table": { }
}
There was a problem hiding this comment.
No, I don't think we would have a new type. It would just have a storage-table associated with it, which is what makes it a materialized view. I'm not sure we need a separate type.
fd3e2d9 to
9a1edfd
Compare
| "GET /v1/{prefix}/namespaces/{namespace}/relations/{relation}", | ||
| "POST /v1/{prefix}/relations/batch-load" |
There was a problem hiding this comment.
I have a minor objection to the use of relations here since we want this to be a general endpoint for resolution. Objects like table/view are considered relations, but something like a function would not (unless you're for a strictly relational algebra definition, but that's not consistent with sql usage).
We may also include other objects in the future, so a more general term like resolve, identifiers, resources, or entities might be better.
There was a problem hiding this comment.
this is related to the other identifier conflict domain discussion where we want to allow the same identifier for a relational object (like table) and a function. With that assumption, the endpoint will need to have object category in the path to distinguish them. Otherwise, we would require identifier uniqueness across all object types, which is not the consensus from the identifier conflict discussion.
| description: | ||
| " | ||
| Load metadata for multiple relations in one request. Identifiers may span different namespaces. | ||
| Each item includes a `TableIdentifier` and optional per-item parameters (`etag` and `snapshots`). |
There was a problem hiding this comment.
Seems like a good time to introduce Identifier that is just the same as a table identifier. Seems odd we would continue to use an object specific identifier type to reference multiple.
Since it has the same structure, you could possibly just have then extend identifier (depending on how that affects the open api structure and generated code).
There was a problem hiding this comment.
I am actually about to share a proposal with the community. https://docs.google.com/document/d/1NTQhgNbP2dkIMuXUMA5JdwliVQKCp1TU_ux5J_AaPiw/edit?tab=t.0#heading=h.xzrfzeom8dqa
| The server resolves each identifier as a table or view. | ||
|
|
||
|
|
||
| The per-item `status` in the response indicates the outcome: |
There was a problem hiding this comment.
This feels awkward because we're using HTTP status codes for non-request/internal results. I don't think that makes a lot of sense and prefer we indicate result behavior in a different way.
There was a problem hiding this comment.
What about defining a enum schema in the REST spec? it will be similar to the http status name though.
BatchLoadItemResultStatus:
type: string
description: |
The outcome of loading a single item in a batch load response.
enum:
- success
- not-modified
- not-found
Open to other suggestions.
There was a problem hiding this comment.
BTW, previously @gaborkaszab and @jbonofre suggested using http status code in the design doc comment.
There was a problem hiding this comment.
I did a little more research. Two patterns are common.
- use http status code
- Microsoft Graph API — JSON batching where each item has its own integer HTTP status, headers, and body. Docs
- Facebook/Meta Graph API — Each item has a code field (HTTP integer) plus optional headers and body. Docs
- Elasticsearch Bulk API — Each item has an integer status field (e.g. 200, 201, 404, 409) plus a string result field (e.g. "created", "updated", "not_found"). Docs
- split into separate lists. AWS services commonly use this pattern.
- AWS DynamoDB BatchGetItem — Found items in Responses, absent items silently omitted, incomplete items in UnprocessedKeys. No status field. Docs
- AWS SQS SendMessageBatch / DeleteMessageBatch — Results split into Successful and Failed lists. Failed entries have Code (string error code like "InvalidParameterValue"), not HTTP status integers. Docs
- AWS S3 DeleteObjects — In verbose mode, successful deletes listed in Deleted, failures in Errors with string Code (e.g. "AccessDenied"). Docs
There was a problem hiding this comment.
I would personally prefer a more structured response than just embedding HTTP codes. I ran into this issue with GraphQL, which returns 200 (403) or something which is a bit confusing.
There was a problem hiding this comment.
Agreed. Switched the per-item shape to a discriminated union using the same object-type-style pattern already in LoadRelationResult:
BatchLoadRelationResultItem:
oneOf:
- $ref: '#/components/schemas/BatchLoadRelationLoaded'
- $ref: '#/components/schemas/BatchLoadRelationNotModified'
- $ref: '#/components/schemas/BatchLoadRelationNotFound'
discriminator:
propertyName: result-type
mapping:
loaded: '#/components/schemas/BatchLoadRelationLoaded'
not-modified: '#/components/schemas/BatchLoadRelationNotModified'
not-found: '#/components/schemas/BatchLoadRelationNotFound'Each variant declares exactly its own required fields:
loaded→result(required) +etag(optional, tables only)not-modified→etag(required)not-found→identifier(required)
Why domain-native names over HTTP status integers:
- Wire stays honest. HTTP codes describe transport; per-item outcomes describe domain state. Keeping them separate means caches, retries, tracing, and monitoring don't have to peek inside the body to know what actually happened.
- Presence rules become schema-enforced.
oneOflets each variant require its own fields instead of relying on prose in the description. - Stronger generated clients.
result-typegenerates into sealed interfaces / discriminated unions / sum types, so the compiler catches missed cases. An integerstatusgenerates into a plainintthat callers switch on by hand. - Extensible without fake codes. Adding outcomes like
skippedorstale-etag-mismatchlater is a new variant — no need to overload429or invent a meaning for an HTTP code that doesn't fit. - Internally consistent with this PR.
LoadRelationResultalready usesobject-typeas a string discriminator; reusing the same style for per-item outcomes keeps the reader's mental model uniform.
| type: string | ||
| description: | | ||
| The type of a catalog object. | ||
| enum: |
There was a problem hiding this comment.
We may add values such as materialized-view or function in the future.
There was a problem hiding this comment.
Materialized view is just a type of view, so I'm not sure we need to distinguish.
There was a problem hiding this comment.
The response object is different for MV, which contains both view metadata and storage table metadata. If we want to load a MV in one round trip, a specific MV type is needed so that client knows how to parse the response.
In the Java library, we may need to define a MaterializedView type, which could be mostly just a container class for a View and a Table fields.
cebd6ab to
b093e29
Compare
| items: | ||
| $ref: '#/components/schemas/BatchLoadRelationRequestItem' | ||
|
|
||
| BatchLoadRelationRequestItem: |
There was a problem hiding this comment.
At either this layer or the URL layer, would it make sense to honor the referenced-by parameter available to loadTable and loadView?
There was a problem hiding this comment.
I think it would have to be at the pre-request level since you need to distinguish different loads (and they may be treated differently for authorization)
| "GET /v1/{prefix}/namespaces/{namespace}/views/{view}" | ||
| "GET /v1/{prefix}/namespaces/{namespace}/views/{view}", | ||
| "GET /v1/{prefix}/namespaces/{namespace}/relations/{relation}", | ||
| "POST /v1/{prefix}/relations/batch-load" |
There was a problem hiding this comment.
This is a little pedantic, but I'm not entirely sold on the resource path here. I looked across many different implementations and found lots of approaches. However, I don't think we need to put the /batch-load at the end. We can just leave it as POST /v1/{prefix}/relations. I could see that you might say "what if we need to create", but we already have a transactions endpoint that is for that operation.
There was a problem hiding this comment.
I added batch-load to the path because POST is used for the batch load so that the list of identifiers can be encoded in the request payload.
If we remove it, it is a bit weird to have POST /v1/{prefix}/relations for batch get purpose.
| Servers MAY cap the amount of computation or response payload size per request and return | ||
| `unprocessed-identifiers` for items they did not process. Clients SHOULD retry unprocessed | ||
| identifiers in a subsequent request. |
There was a problem hiding this comment.
Do we want to describe how to handle if too many items are requested? I understand the server can process a subset and return the unprocessed list, but what if someone lists an entire catalog and then asks for 100K resources? What error code can/should the server return if they consider the request unreasonable (400 might fit that)
There was a problem hiding this comment.
Good catch, thanks. Updated the spec.
unprocessed-identifiers was only meant as a cooperative soft cap — the server accepted the request, then stopped partway due to its own cost budget. It doesn't cover the "please reject this upfront" case you described.
Added a typed hard cap to CatalogConfig:
relations-batch-load-max-items:
type: integer
minimum: 1Advertising the limit lets well-behaved clients chunk proactively, and servers that receive an oversized request still have a spec-supported way to reject with 400.
The batchLoadRelations description and unprocessed-identifiers field now spell out the two mechanisms as distinct:
relations-batch-load-max-items— governs whether the request is accepted at all (400if exceeded).unprocessed-identifiers— reports partial progress within an accepted request.
Went with 400 rather than 413 since the limit is on identifier count (a request property), not body size — happy to switch if you disagree.
Add the CatalogObjectType string enum (table, view) as defined in apache#15830 (universal relation load). It serves as the companion discriminator for CatalogObjectIdentifier. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Introduces two related REST schemas: - CatalogObjectIdentifier: a bare array of hierarchical levels that references a catalog object (table, view, or namespace). The object kind is determined by context (e.g. the endpoint or a companion CatalogObjectType discriminator), not by the identifier structure alone. - CatalogObjectType: an enum of "table", "view", and "namespace" intended to be used as a discriminator alongside CatalogObjectIdentifier. Also regenerates rest-catalog-open-api.py to match. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a single POST /v1/{prefix}/resolve endpoint that resolves one or more
catalog objects to their current state in one request. Stacks on top of
the CatalogObjectIdentifier + CatalogObjectType schemas introduced in the
previous commit.
Request body carries one or more typed arrays of items (currently only
`relations`; designed to be extended with sibling arrays such as
`functions` in the future). Each relation item carries a
CatalogObjectIdentifier plus optional per-item hints (`etag`, `snapshots`)
that apply when the resolved relation is a table.
Response body carries parallel typed arrays of per-item results. For
`relations`, each result is a ResolveRelationResult whose `status` field
discriminates between three outcomes:
- `loaded`: the relation exists; the typed payload is returned in `result`
as a LoadRelationResult (object-type + table/view branch). For tables,
the current `etag` MAY also be included.
- `not-modified`: the relation is a table whose ETag matches the caller's
provided `etag`; no payload is returned, only the current `etag`.
- `not-found`: no table or view exists for the identifier; the item MAY
include a structured error.
Partial progress: servers MAY return a subset of items under `unprocessed`
(a typed object mirroring the request shape, e.g. `unprocessed.relations`)
when a request exceeds internal cost/payload budgets. Each unprocessed
entry carries the identifier plus optional `code` and `reason`.
CatalogConfig gains a `resolve-max-items` field advertising the maximum
total items the server will accept in one request. Exceeding the limit
MUST cause the server to reject the whole request with 400; the
`unprocessed` mechanism is distinct and reports partial progress within an
accepted request.
Authorization failures for any requested item SHOULD fail the entire
request with 403; the error body carries `forbidden-identifiers` listing
the offending identifiers.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
20207d4 to
f1ff6e7
Compare
Drop the separate UnprocessedRelation schema and its diagnostic fields (`code`, `reason`). Instead, `UnprocessedItem.relations` now echoes the original `ResolveRelationItem` entries the server didn't process, so clients retry by re-submitting exactly those items without having to reconstruct them. This keeps unprocessed entries in lockstep with the request shape (including per-item `etag` and `snapshots` hints) as the schema grows with future typed arrays. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Design doc: https://docs.google.com/document/d/1VW5hgaaajRWtp5KbOU3s83YyoyPi5WOSvHtoJ_yXzJs/edit?tab=t.0#heading=h.e6w7vgpr8t2f
Adds
POST /v1/{prefix}/resolve— a single endpoint that takes one or more typed arrays of catalog items (currentlyrelations; extensible to e.g.functions) and returns their current state with per-item outcomes (loaded,not-modified,not-found) plus a typedunprocessedlist for partial-progress capping.