Skip to content

REST Spec: Add resolve endpoint for catalog objects#15830

Open
stevenzwu wants to merge 3 commits into
apache:mainfrom
stevenzwu:rest-spec-universal-load
Open

REST Spec: Add resolve endpoint for catalog objects#15830
stevenzwu wants to merge 3 commits into
apache:mainfrom
stevenzwu:rest-spec-universal-load

Conversation

@stevenzwu
Copy link
Copy Markdown
Contributor

@stevenzwu stevenzwu commented Mar 30, 2026

Design doc: https://docs.google.com/document/d/1VW5hgaaajRWtp5KbOU3s83YyoyPi5WOSvHtoJ_yXzJs/edit?tab=t.0#heading=h.e6w7vgpr8t2f

Adds POST /v1/{prefix}/resolve — a single endpoint that takes one or more typed arrays of catalog items (currently relations; extensible to e.g. functions) and returns their current state with per-item outcomes (loaded, not-modified, not-found) plus a typed unprocessed list for partial-progress capping.

view:
$ref: '#/components/schemas/LoadViewResult'

LoadRelationResult:
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Future materialized-view would look like

{
  "object-type": "materialized-view",
  "view": { },
  "storage-table": { }
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I don't think we would have a new type. It would just have a storage-table associated with it, which is what makes it a materialized view. I'm not sure we need a separate type.

@stevenzwu stevenzwu force-pushed the rest-spec-universal-load branch from fd3e2d9 to 9a1edfd Compare March 30, 2026 20:40
Comment thread open-api/rest-catalog-open-api.yaml Outdated
Comment on lines +2284 to +2285
"GET /v1/{prefix}/namespaces/{namespace}/relations/{relation}",
"POST /v1/{prefix}/relations/batch-load"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a minor objection to the use of relations here since we want this to be a general endpoint for resolution. Objects like table/view are considered relations, but something like a function would not (unless you're for a strictly relational algebra definition, but that's not consistent with sql usage).

We may also include other objects in the future, so a more general term like resolve, identifiers, resources, or entities might be better.

Copy link
Copy Markdown
Contributor Author

@stevenzwu stevenzwu Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is related to the other identifier conflict domain discussion where we want to allow the same identifier for a relational object (like table) and a function. With that assumption, the endpoint will need to have object category in the path to distinguish them. Otherwise, we would require identifier uniqueness across all object types, which is not the consensus from the identifier conflict discussion.

Comment thread open-api/rest-catalog-open-api.yaml Outdated
Comment thread open-api/rest-catalog-open-api.yaml Outdated
description:
"
Load metadata for multiple relations in one request. Identifiers may span different namespaces.
Each item includes a `TableIdentifier` and optional per-item parameters (`etag` and `snapshots`).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like a good time to introduce Identifier that is just the same as a table identifier. Seems odd we would continue to use an object specific identifier type to reference multiple.

Since it has the same structure, you could possibly just have then extend identifier (depending on how that affects the open api structure and generated code).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment thread open-api/rest-catalog-open-api.yaml Outdated
The server resolves each identifier as a table or view.


The per-item `status` in the response indicates the outcome:
Copy link
Copy Markdown
Contributor

@danielcweeks danielcweeks Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels awkward because we're using HTTP status codes for non-request/internal results. I don't think that makes a lot of sense and prefer we indicate result behavior in a different way.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about defining a enum schema in the REST spec? it will be similar to the http status name though.

    BatchLoadItemResultStatus:
      type: string
      description: |
        The outcome of loading a single item in a batch load response.
      enum:
        - success
        - not-modified
        - not-found

Open to other suggestions.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, previously @gaborkaszab and @jbonofre suggested using http status code in the design doc comment.

Copy link
Copy Markdown
Contributor Author

@stevenzwu stevenzwu Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a little more research. Two patterns are common.

  1. use http status code
  • Microsoft Graph API — JSON batching where each item has its own integer HTTP status, headers, and body. Docs
  • Facebook/Meta Graph API — Each item has a code field (HTTP integer) plus optional headers and body. Docs
  • Elasticsearch Bulk API — Each item has an integer status field (e.g. 200, 201, 404, 409) plus a string result field (e.g. "created", "updated", "not_found"). Docs
  1. split into separate lists. AWS services commonly use this pattern.
  • AWS DynamoDB BatchGetItem — Found items in Responses, absent items silently omitted, incomplete items in UnprocessedKeys. No status field. Docs
  • AWS SQS SendMessageBatch / DeleteMessageBatch — Results split into Successful and Failed lists. Failed entries have Code (string error code like "InvalidParameterValue"), not HTTP status integers. Docs
  • AWS S3 DeleteObjects — In verbose mode, successful deletes listed in Deleted, failures in Errors with string Code (e.g. "AccessDenied"). Docs

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would personally prefer a more structured response than just embedding HTTP codes. I ran into this issue with GraphQL, which returns 200 (403) or something which is a bit confusing.

Copy link
Copy Markdown
Contributor Author

@stevenzwu stevenzwu Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Switched the per-item shape to a discriminated union using the same object-type-style pattern already in LoadRelationResult:

BatchLoadRelationResultItem:
  oneOf:
    - $ref: '#/components/schemas/BatchLoadRelationLoaded'
    - $ref: '#/components/schemas/BatchLoadRelationNotModified'
    - $ref: '#/components/schemas/BatchLoadRelationNotFound'
  discriminator:
    propertyName: result-type
    mapping:
      loaded:       '#/components/schemas/BatchLoadRelationLoaded'
      not-modified: '#/components/schemas/BatchLoadRelationNotModified'
      not-found:    '#/components/schemas/BatchLoadRelationNotFound'

Each variant declares exactly its own required fields:

  • loadedresult (required) + etag (optional, tables only)
  • not-modifiedetag (required)
  • not-foundidentifier (required)

Why domain-native names over HTTP status integers:

  • Wire stays honest. HTTP codes describe transport; per-item outcomes describe domain state. Keeping them separate means caches, retries, tracing, and monitoring don't have to peek inside the body to know what actually happened.
  • Presence rules become schema-enforced. oneOf lets each variant require its own fields instead of relying on prose in the description.
  • Stronger generated clients. result-type generates into sealed interfaces / discriminated unions / sum types, so the compiler catches missed cases. An integer status generates into a plain int that callers switch on by hand.
  • Extensible without fake codes. Adding outcomes like skipped or stale-etag-mismatch later is a new variant — no need to overload 429 or invent a meaning for an HTTP code that doesn't fit.
  • Internally consistent with this PR. LoadRelationResult already uses object-type as a string discriminator; reusing the same style for per-item outcomes keeps the reader's mental model uniform.

Comment thread open-api/rest-catalog-open-api.yaml Outdated
type: string
description: |
The type of a catalog object.
enum:
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may add values such as materialized-view or function in the future.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Materialized view is just a type of view, so I'm not sure we need to distinguish.

Copy link
Copy Markdown
Contributor Author

@stevenzwu stevenzwu Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The response object is different for MV, which contains both view metadata and storage table metadata. If we want to load a MV in one round trip, a specific MV type is needed so that client knows how to parse the response.

In the Java library, we may need to define a MaterializedView type, which could be mostly just a container class for a View and a Table fields.

Comment thread open-api/rest-catalog-open-api.yaml Outdated
items:
$ref: '#/components/schemas/BatchLoadRelationRequestItem'

BatchLoadRelationRequestItem:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At either this layer or the URL layer, would it make sense to honor the referenced-by parameter available to loadTable and loadView?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would have to be at the pre-request level since you need to distinguish different loads (and they may be treated differently for authorization)

Comment thread open-api/rest-catalog-open-api.yaml Outdated
"GET /v1/{prefix}/namespaces/{namespace}/views/{view}"
"GET /v1/{prefix}/namespaces/{namespace}/views/{view}",
"GET /v1/{prefix}/namespaces/{namespace}/relations/{relation}",
"POST /v1/{prefix}/relations/batch-load"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a little pedantic, but I'm not entirely sold on the resource path here. I looked across many different implementations and found lots of approaches. However, I don't think we need to put the /batch-load at the end. We can just leave it as POST /v1/{prefix}/relations. I could see that you might say "what if we need to create", but we already have a transactions endpoint that is for that operation.

Copy link
Copy Markdown
Contributor Author

@stevenzwu stevenzwu Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added batch-load to the path because POST is used for the batch load so that the list of identifiers can be encoded in the request payload.

If we remove it, it is a bit weird to have POST /v1/{prefix}/relations for batch get purpose.

Comment thread open-api/rest-catalog-open-api.yaml Outdated
Comment on lines +1933 to +1935
Servers MAY cap the amount of computation or response payload size per request and return
`unprocessed-identifiers` for items they did not process. Clients SHOULD retry unprocessed
identifiers in a subsequent request.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to describe how to handle if too many items are requested? I understand the server can process a subset and return the unprocessed list, but what if someone lists an entire catalog and then asks for 100K resources? What error code can/should the server return if they consider the request unreasonable (400 might fit that)

Copy link
Copy Markdown
Contributor Author

@stevenzwu stevenzwu Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, thanks. Updated the spec.

unprocessed-identifiers was only meant as a cooperative soft cap — the server accepted the request, then stopped partway due to its own cost budget. It doesn't cover the "please reject this upfront" case you described.

Added a typed hard cap to CatalogConfig:

relations-batch-load-max-items:
  type: integer
  minimum: 1

Advertising the limit lets well-behaved clients chunk proactively, and servers that receive an oversized request still have a spec-supported way to reject with 400.

The batchLoadRelations description and unprocessed-identifiers field now spell out the two mechanisms as distinct:

  • relations-batch-load-max-items — governs whether the request is accepted at all (400 if exceeded).
  • unprocessed-identifiers — reports partial progress within an accepted request.

Went with 400 rather than 413 since the limit is on identifier count (a request property), not body size — happy to switch if you disagree.

stevenzwu added a commit to stevenzwu/iceberg that referenced this pull request Apr 28, 2026
Add the CatalogObjectType string enum (table, view) as defined in
apache#15830 (universal relation load). It serves as the companion
discriminator for CatalogObjectIdentifier.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@stevenzwu stevenzwu changed the title REST Spec: Add single and batch endpoints for loading relational objects (table, view, and future MV) REST Spec: Add resolve endpoint for catalog objects Apr 29, 2026
stevenzwu and others added 2 commits April 29, 2026 15:33
Introduces two related REST schemas:

- CatalogObjectIdentifier: a bare array of hierarchical levels that
  references a catalog object (table, view, or namespace). The object
  kind is determined by context (e.g. the endpoint or a companion
  CatalogObjectType discriminator), not by the identifier structure
  alone.
- CatalogObjectType: an enum of "table", "view", and "namespace"
  intended to be used as a discriminator alongside
  CatalogObjectIdentifier.

Also regenerates rest-catalog-open-api.py to match.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a single POST /v1/{prefix}/resolve endpoint that resolves one or more
catalog objects to their current state in one request. Stacks on top of
the CatalogObjectIdentifier + CatalogObjectType schemas introduced in the
previous commit.

Request body carries one or more typed arrays of items (currently only
`relations`; designed to be extended with sibling arrays such as
`functions` in the future). Each relation item carries a
CatalogObjectIdentifier plus optional per-item hints (`etag`, `snapshots`)
that apply when the resolved relation is a table.

Response body carries parallel typed arrays of per-item results. For
`relations`, each result is a ResolveRelationResult whose `status` field
discriminates between three outcomes:

- `loaded`: the relation exists; the typed payload is returned in `result`
  as a LoadRelationResult (object-type + table/view branch). For tables,
  the current `etag` MAY also be included.
- `not-modified`: the relation is a table whose ETag matches the caller's
  provided `etag`; no payload is returned, only the current `etag`.
- `not-found`: no table or view exists for the identifier; the item MAY
  include a structured error.

Partial progress: servers MAY return a subset of items under `unprocessed`
(a typed object mirroring the request shape, e.g. `unprocessed.relations`)
when a request exceeds internal cost/payload budgets. Each unprocessed
entry carries the identifier plus optional `code` and `reason`.

CatalogConfig gains a `resolve-max-items` field advertising the maximum
total items the server will accept in one request. Exceeding the limit
MUST cause the server to reject the whole request with 400; the
`unprocessed` mechanism is distinct and reports partial progress within an
accepted request.

Authorization failures for any requested item SHOULD fail the entire
request with 403; the error body carries `forbidden-identifiers` listing
the offending identifiers.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@stevenzwu stevenzwu force-pushed the rest-spec-universal-load branch from 20207d4 to f1ff6e7 Compare April 30, 2026 20:26
Drop the separate UnprocessedRelation schema and its diagnostic fields
(`code`, `reason`). Instead, `UnprocessedItem.relations` now echoes the
original `ResolveRelationItem` entries the server didn't process, so
clients retry by re-submitting exactly those items without having to
reconstruct them. This keeps unprocessed entries in lockstep with the
request shape (including per-item `etag` and `snapshots` hints) as the
schema grows with future typed arrays.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants