Skip to content

OpenAPI: Add CatalogObjectIdentifier schema#16144

Open
stevenzwu wants to merge 2 commits into
apache:mainfrom
stevenzwu:rest-catalog-object-identifier
Open

OpenAPI: Add CatalogObjectIdentifier schema#16144
stevenzwu wants to merge 2 commits into
apache:mainfrom
stevenzwu:rest-catalog-object-identifier

Conversation

@stevenzwu
Copy link
Copy Markdown
Contributor

@stevenzwu stevenzwu commented Apr 28, 2026

Summary

Adds the CatalogObjectIdentifier schema to the REST catalog spec — an ordered list of hierarchical levels (["accounting", "tax", "paid"]) that works uniformly for tables, views, materialized views, and namespaces. The kind of object an identifier refers to is determined by context (the endpoint, or a companion type discriminator defined by that endpoint), not by the identifier structure itself.

Structurally the same as Namespace (a bare array of strings); the distinct name signals "any catalog object" rather than specifically a namespace path.

Motivation

Multiple concurrent efforts need a generic catalog-object identifier and would otherwise each introduce their own:

Introducing one shared schema avoids identifier proliferation as new object types (functions, materialized views) are added to the spec.

Scope

Intentionally minimal:

  • Add CatalogObjectIdentifier alongside the existing TableIdentifier and Namespace.
  • No changes to existing endpoints. All current references to TableIdentifier and Namespace are preserved — no breaking changes.
  • No discriminator enum is included. The originally-proposed shared CatalogObjectType was dropped during review; the resolve and events endpoints will each define their own narrower closed enums in their respective PRs, since each has a different forward-compat profile.

Test plan

  • make -C open-api lint passes (openapi-spec-validator + yamllint --strict)
  • make -C open-api generate regenerates rest-catalog-open-api.py cleanly
  • python3 -m py_compile open-api/rest-catalog-open-api.py succeeds

🤖 Generated with Claude Code

@stevenzwu stevenzwu changed the title OpenAPI: Add CatalogObjectIdentifier schema OpenAPI: Add CatalogObjectIdentifier and CatalogObjectType schemas Apr 28, 2026
Introduces two related REST schemas:

- CatalogObjectIdentifier: a bare array of hierarchical levels that
  references a catalog object (table, view, or namespace). The object
  kind is determined by context (e.g. the endpoint or a companion
  CatalogObjectType discriminator), not by the identifier structure
  alone.
- CatalogObjectType: an enum of "table", "view", and "namespace"
  intended to be used as a discriminator alongside
  CatalogObjectIdentifier.

Also regenerates rest-catalog-open-api.py to match.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@stevenzwu stevenzwu force-pushed the rest-catalog-object-identifier branch from 0b17740 to e6a0323 Compare April 29, 2026 22:33
stevenzwu added a commit to stevenzwu/iceberg that referenced this pull request Apr 29, 2026
Adds Java reference implementations for the REST schemas introduced
in apache#16144.

- api/org.apache.iceberg.catalog.CatalogObjectIdentifier: hand-written
  POJO mirroring Namespace — static of(String...) factory,
  null/null-byte validation, levels()/level(i)/length() accessors,
  dotted toString.
- api/org.apache.iceberg.catalog.CatalogObjectType: enum of TABLE,
  VIEW, NAMESPACE with lowercase wire strings and a fromName factory,
  mirroring PlanStatus.
- Registers a bare-array serializer and deserializer for
  CatalogObjectIdentifier in RESTSerializers, matching the way
  Namespace is wired.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread open-api/rest-catalog-open-api.yaml Outdated
Comment on lines +2282 to +2289
CatalogObjectType:
type: string
description: |
The type of a catalog object.
enum:
- table
- view
- namespace
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there was an open question about how and where this would actually be used. Introducing type without context leaves me unsure if it's inline with how we want to reference types.

For example, you could have the resolve endpoint return:

[
  [ identifier, type, metadata ] 
  [ identifier, type, metadata ]
  ...
]

or:

tables: <identifier, metadata>
views: <identifier, metadata>
namespaces: <identifier, metadata>

The first approach requires type, but might impact backward compatibility. If we introduce a new type (e.g. function) then clients would break if they don't understand the type. The second approach allows you to extend the response object without modifying the original fields.

I'd like to see how it would be referenced in context.

Copy link
Copy Markdown
Contributor Author

@stevenzwu stevenzwu Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the design doc for the resolve endpoint and included example request and response payload json
https://docs.google.com/document/d/1VW5hgaaajRWtp5KbOU3s83YyoyPi5WOSvHtoJ_yXzJs/edit?tab=t.0#heading=h.z0wh4486aab5

Here is the usage from the events endpoint spec PR. It is used as an event filter in the request body.
https://github.com/apache/iceberg/pull/12584/changes#r3170935023

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the Events Endpoint having this explicit type filter is valuable I believe which is why we introduced this type there.
Shouldn't clients be tolerant in both cases - for unknown enum variants as well as unknown fields?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's as easy as saying the clients should tolerate it. If you add new types, generated parsers will break when they encounter something that wasn't originally enumerated. That's why the type approach feels brittle.

@rdblue might have an opinion here since he originally brought it up in the discussion.

Copy link
Copy Markdown
Contributor Author

@stevenzwu stevenzwu May 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify Dan's framing, here are the two response shapes he sketched:

Shape A — flat array of typed records, each carrying its own discriminator:

[
  { "identifier": [...], "type": "table", "metadata": {...} },
  { "identifier": [...], "type": "view",  "metadata": {...} }
]

Shape B — separate bucket per kind:

{
  "tables":     [ { "identifier": [...], "metadata": {...} }, ... ],
  "views":      [ ... ],
  "namespaces": [ ... ]
}

Dan's critique: Shape A's type is a closed enum, so adding a new value (e.g. function) breaks generated parsers on old clients.

The Shape B counter has its own forward-compat hole — and a worse one. An old client that doesn't know the materialized_views key never iterates that field, so the identifier just vanishes from the resolved set; the client can't even distinguish "didn't resolve" from "resolved to a kind I don't recognize." A parse error at least tells you something is wrong.

The design doc actually already proposes a hybrid that handles the top-level concern: per-category typed arrays (relations: [...], future functions: [...]) bucket the major categories Shape-B style, while inside each bucket the result is a flat record that carries an object-type discriminator Shape-A style:

{
  "relations": [
    {
      "identifier": ["analytics", "daily_sales_view"],
      "status": "loaded",
      "result": { "object-type": "view", "view": { /* LoadViewResult */ } }
    }
  ]
}

Adding a new top-level category (functions) is safe: generators silently ignore unknown top-level fields. The forward-compat risk only lives on result.object-type — that's where adding materialized-view later could break old codegen if it's a closed enum.

Proposal: spec result.object-type (and CatalogObjectType itself) as type: string with documented values, not a closed enum.

ResolveResult:
  type: object
  required: [object-type]
  properties:
    object-type:
      type: string
      description: |
        Object kind. Currently one of: table, view, materialized-view.
        Clients should fall through to a default handler for unrecognized values.
    table:
      $ref: '#/components/schemas/LoadTableResult'
    view:
      $ref: '#/components/schemas/LoadViewResult'

What this buys:

  • No codegen breakage. datamodel-codegen emits object_type: str, not Literal[...]. Pydantic, Jackson with default config, Go's encoding/json, etc. all accept arbitrary strings.
  • No silent data loss. Every resolved identifier still has a row with identifier, status, and result. A hand-written client switches on known object-type values and falls through unknown ones to a "skip / report unsupported" branch with full visibility.
  • Schema still documents the valid set. The description carries the enumeration; humans and IDE tooltips see it. We can validate server-side conformance in our own CI without imposing closed-enum behavior on every generated client.
  • Filter input keeps the closed enum. CatalogObjectType as enum on request bodies (e.g. the events filter) is fine — the client picks values from its own schema; the server tolerates older filter sets.

Prior art for this pattern:

  • Iceberg REST already uses it internally. The most-evolved discriminator fields in this spec are declared as plain type: string, not closed enums, with valid values listed via discriminator.mapping:

    • MetadataUpdate.action (BaseUpdate) — the set has grown over time (add-encryption-key, remove-encryption-key, remove-schemas, remove-partition-specs, enable-row-lineage, set-partition-statistics, remove-partition-statistics, …) without breaking clients with stale schemas.
    • TableRequirement.type — same pattern.
    • ViewRequirement.type — same pattern.

    Closed enum in this spec is reserved for stable sets (FileFormat, SortDirection, NullOrder, SnapshotRefType) that aren't expected to grow casually. CatalogObjectType is clearly in the first group — materialized-view and function are already foreseen.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated proposal: drop the shared CatalogObjectType and split it into two narrower closed enums, one per use case.

Resolve response: closed enum [table, view, materialized-view]. The relations universe is small enough to enumerate, and pre-listing materialized-view even before MV is implemented means clients ship with the value already in their schema — no breakage when a server starts returning it. Trade-off is a maintenance commitment: adding a 4th relational type later (e.g. external-table, streaming-view) becomes a coordinated spec change rather than a transparent extension.

Events filter: closed enum on the request body. Client picks values from its own schema; server rejects unknowns with a clear error. No codegen-breakage concern, since the field never carries server-originated values back to old clients.

Events response: no separate object-type discriminator needed. The events spec (#12584) already defines an operation-type enum (create-table, drop-view, etc.) on each event, which implicitly encodes the kind of object the event is about.

Naming: two independent named schemas — RelationType near the resolve endpoint, EventObjectType (or similar) near events. Decoupling is the point: each enum evolves on its own schedule, which is the property the shared CatalogObjectType couldn't deliver.

CatalogObjectType was introduced as a single shared discriminator for
two prospective consumers — the resolve endpoint response and the
events endpoint filter — but those use cases have different
forward-compat profiles and will define their own narrower closed
enums (RelationType near resolve, EventObjectType near events).
Removing the shared schema avoids cross-coupling those evolution
schedules and lets each endpoint commit only to the values it needs.

Also rewords the CatalogObjectIdentifier description to no longer
reference CatalogObjectType by name.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@stevenzwu stevenzwu changed the title OpenAPI: Add CatalogObjectIdentifier and CatalogObjectType schemas OpenAPI: Add CatalogObjectIdentifier schema May 11, 2026
stevenzwu added a commit to stevenzwu/iceberg that referenced this pull request May 12, 2026
Mirrors the spec change in apache#16144: the shared
CatalogObjectType schema is being removed since the resolve and events
endpoints will each define their own narrower closed enums in their
respective PRs. Also rewords the CatalogObjectIdentifier Javadoc to
no longer reference CatalogObjectType by name.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
stevenzwu added a commit to stevenzwu/iceberg that referenced this pull request May 12, 2026
Adds Java reference implementations for the REST schemas introduced
in apache#16144.

- api/org.apache.iceberg.catalog.CatalogObjectIdentifier: hand-written
  POJO mirroring Namespace — static of(String...) factory,
  null/null-byte validation, levels()/level(i)/length() accessors,
  dotted toString.
- api/org.apache.iceberg.catalog.CatalogObjectType: enum of TABLE,
  VIEW, NAMESPACE with lowercase wire strings and a fromName factory,
  mirroring PlanStatus.
- Registers a bare-array serializer and deserializer for
  CatalogObjectIdentifier in RESTSerializers, matching the way
  Namespace is wired.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
stevenzwu added a commit to stevenzwu/iceberg that referenced this pull request May 12, 2026
Mirrors the spec change in apache#16144: the shared
CatalogObjectType schema is being removed since the resolve and events
endpoints will each define their own narrower closed enums in their
respective PRs. Also rewords the CatalogObjectIdentifier Javadoc to
no longer reference CatalogObjectType by name.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
companion type discriminator), not by the identifier structure alone.
type: array
items:
type: string
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we apply any constraints on the table/view/namespace name? For example, no slash(/) is allowed. Given we didn't specify any constraint on the table identifier, it's not a blocker for this PR. We can work on that as a followup. We can also discuss whether we could avoid any constraints in IRC, and relying on the implementations(catalogs, engines) to cast their options.

Copy link
Copy Markdown
Member

@RussellSpitzer RussellSpitzer May 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have previously let folks do whatever they want in string fields and left it up to the implementation to decide whether or not that string is invalid.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants