Skip to content

REST Spec: Events endpoint#12584

Open
c-thiel wants to merge 24 commits into
apache:mainfrom
c-thiel:ct/irc-events-endpoint
Open

REST Spec: Events endpoint#12584
c-thiel wants to merge 24 commits into
apache:mainfrom
c-thiel:ct/irc-events-endpoint

Conversation

@c-thiel
Copy link
Copy Markdown
Contributor

@c-thiel c-thiel commented Mar 20, 2025

@c-thiel c-thiel marked this pull request as draft March 20, 2025 09:33
Comment thread open-api/rest-catalog-open-api.yaml Outdated
@c-thiel
Copy link
Copy Markdown
Contributor Author

c-thiel commented Apr 16, 2025

There seems to be a bug in the updated datamodel-code-generator==0.28.5 that is used on main, which is why the pipeline fails.
datamodel-code-generator==0.28.4 generates the discriminator correctly as discriminator='reference-type' (with -) while the updated version on main uses _ which I believe is wrong.

@c-thiel c-thiel force-pushed the ct/irc-events-endpoint branch from c972354 to 4d67051 Compare May 7, 2025 09:07
Comment thread open-api/rest-catalog-open-api.yaml Outdated
Comment thread open-api/rest-catalog-open-api.yaml Outdated
Comment thread open-api/rest-catalog-open-api.yaml
Comment thread open-api/rest-catalog-open-api.yaml Outdated
Comment thread open-api/rest-catalog-open-api.py Outdated
@c-thiel c-thiel force-pushed the ct/irc-events-endpoint branch from 14b249e to 3de9a7c Compare May 28, 2025 17:21
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jul 1, 2025

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

@github-actions github-actions Bot added the stale label Jul 1, 2025
@Fokko Fokko added not-stale and removed stale labels Jul 1, 2025
Comment thread open-api/rest-catalog-open-api.yaml Outdated
Comment thread open-api/rest-catalog-open-api.yaml Outdated
Comment thread open-api/rest-catalog-open-api.yaml Outdated
List of catalog objects (namespaces, tables, views) to get events for.
If not provided, events for all objects will be returned subject to other filters.
For specified namespaces, events for the namespaces and all containing objects
(namespaces, tables, views) will be returned.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: is "will" correct if this is defining what services should or must do? I would expect this to be "must be returned" if we want to specify the service behavior.

In addition, is the requirement to return changes for child objects recursive or for one layer only?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to "must". Added hint for recursive.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main motivation for this beeing recursive is that a namespace has almost no value has it only has the update-namespace-properties event attached to it. What most folks are probably interested in is everything that happend in a Namespace.
Thinking about it again though its not nice that some filter behave differently than others. Happy to discuss this again.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After discussing this we ended up keeping it recursive for the following reasons:

  • Otherwise not possible to sub-select an area of the Catalog without previously crawling it
  • Otherwise not possible to efficiently listen to Create Events, as specific identifiers are not yet known
  • Consistent with the behaviour on Warehouse level which is recursive.

Comment thread open-api/rest-catalog-open-api.yaml Outdated
If not provided, all types are returned.
catalog-objects:
type: array
discriminator:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably my lack of experience with Open API, but what is this doing? Is the array of a single type or multiple types?

My expectation is that items would be a type that has a descriminator so you could request changes for a view and a table -- for instance to see changes to a view that references a table. But with the descriminator nested in the array I'm wondering if this is a single type for all items. And if so, what does that look like in JSON given that the array itself must contain objects and can't have a reference-type property.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like reference-type is included in each type definition so I'm guessing this behaving as I would expect. It just seems like a confusing way to embed this since the array can't have a property. Refactoring this may make it more clear.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your understanding of the behavior is correct. I tested it with two code generators and both worked OK. Nonetheless it is a quite exotic construct. I used it to avoid introducing another type and safe some code.

As clarity is more important, I added a new type CatalogObjectReference now.

If not provided, events for all objects will be returned subject to other filters.
For specified namespaces, events for the namespaces and all containing objects
(namespaces, tables, views) will be returned.
custom-filters:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why have a sub-object of custom filters rather than setting additionalProperties: true on the request object?

I think this also allows passing any JSON structure. Do we want to limit that as we do elsewhere?

  additionalProperties:
    type: string

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer to separate custom filters to avoid potential collisions with filters that we standardize later.

I added type: string

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we already have custom filter use cases in mind?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A filter on actor for example

Comment thread open-api/rest-catalog-open-api.yaml Outdated
Implementation-specific filter extensions. Implementations may define custom filter
properties beyond the standard ones defined in this specification.

NamespaceReference:
Copy link
Copy Markdown
Contributor

@rdblue rdblue Jul 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these references necessary? Namespaces, views, and tables share a common namespace so any identifier can identify only one. Could this be simpler by using a single more generic identifier like ObjectIdentifier?

I'm not sure what the value of the separation between TableIdentifier and Namespace is, so I want to avoid unnecessary complexity based on it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about this too, but couldn't find another good way that:

  • Differentiates between Views, Tables and Namespaces, so that individual objects can be targeted. TableIdentifier requires name, so we can't use it to express "Give me everything in Namespace X"
  • Keeps uuids. I prefer to work with UUIDs instead of names. Keeping UUIDs here was also a wish from the community.

If you have another idea how we can make this more compact, I am happy to change!
Using inline schemas led to almost the same amount of code, but was much less readable.

properties:
next-page-token:
$ref: "#/components/schemas/PageToken"
highest-processed-timestamp-ms:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking through how catalogs will evolve to support transactions, I think it is likely that some of them will have a catalog-level sequence number for changes. This schema doesn't prevent us from either allowing that in a response or adding it in the future, right? I don't think so but I want to raise it for everyone to think about.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not prevent us from adding it in the future.
Returning it now would be a minor violation as we don't set additionalProperties, although clients should be prepared to discard unknown fields anyway.

We have the request-id also as part of the Event itself, which could be filled with the transaction id of a catalog if the catalog supports this.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did I misunderstand the purpose of next-page-token? Is it only for pagination of one query checkpoint?

Is the timestamp meant for the query checkpoint? Some catalogs may implement a global catalog sequence number to order all catalog changes, which would be a great fit for the query checkpoint for continuation.

If we use a timestamp as the continuation point, it would requires server to implement a monotonically increasing clock for this to work correctly.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only the next-page-token can and should be used for a query checkpoint and it is only valid for the same filter combination.

We had some discussion about using a global sequence number instead, but settled in a discussion that not all catalogs offer a global sequence number, so we cannot rely on it for pagination. highest-processed-timestamp-ms is only informational for the user, so that clients get an idea on what events have been included in the current batch.

If we introduce the concept of a global sequence number elsewhere (presumably optional?) we should introdcue this as a new field in the request & response objects of the Events Endpoint as well.

Copy link
Copy Markdown
Contributor

@stevenzwu stevenzwu Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only the next-page-token can and should be used for a query checkpoint

thanks for confirming that.

If we introduce the concept of a global sequence number elsewhere (presumably optional?) we should introdcue this as a new field in the request & response objects of the Events Endpoint as well.

If catalog has a global sequence number, the PageToken can be set to the sequence number for checkpoint/continuation point.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this require timestamp to be monotonic? if not, what's the value of this field? We already have the continuation-token for the next query resume point.

Comment thread open-api/rest-catalog-open-api.yaml Outdated
@c-thiel
Copy link
Copy Markdown
Contributor Author

c-thiel commented Dec 6, 2025

Summary of changes since early December 2025. Changes are motivated from discussions in this threads or learnings from the Rust request & response object implementations in apache/iceberg-rust#1907:

  • Allow JSON Values for custom filter (change additionalProperties: type: String to additionalProperties: true in 7f17a42
  • Make Actor an Object instead of a string in 3eeb95e. In most real-world szenarios an Actor will have more than one field. The new design is more extendable as we don't have to squish a JSON object into a string type, but can instead pass an extendable Object as actor.
  • Add missing updates field to CreateViewOperation
  • Rename EventsResponse to QueryEventsResponse
  • Rename CatalogObject to CatalogObjectIdentifier
  • Rename event-count in Event to request-event-count to be inline with request-id and hint a bit more that these are the events in a request and not the total events in the response
  • Rename page-token to continuation-token and make it required. For reasoning see Discussion in 178ed3e
  • Add missing type: string to OperationType - anyof - enum field in c73e200
  • Fix duplicate namespace field in CreateNamespaceOperation (via allOf and explicit) in 2f9b4db

Ping @stevenzwu @aheev

@aheev
Copy link
Copy Markdown

aheev commented Dec 6, 2025

Summary of changes since early December 2025. Changes are motivated from discussions in this threads or learnings from the Rust request & response object implementations in apache/iceberg-rust#1907:

  • Allow JSON Values for custom filter (change additionalProperties: type: String to additionalProperties: true in 7f17a42
  • Make Actor an Object instead of a string in 3eeb95e. In most real-world szenarios an Actor will have more than one field, and even if not, the Catalog can simply return {"id": "..."}, which IMO would still be clearer and more extendable than a plain string.
  • Add missing updates field to CreateViewOperation
  • Rename EventsResponse to QueryEventsResponse
  • Rename CatalogObject to CatalogObjectIdentifier
  • Rename event-count in Event to request-event-count to be inline with request-id and hint a bit more that these are the events in a request and not the total events in the response
  • Rename page-token to continuation-token and make it required. For reasoning see Discussion in 178ed3e
  • Add missing type: string to OperationType - anyof - enum field in c73e200
  • Fix duplicate namespace field in CreateNamespaceOperation (via allOf and explicit) in 2f9b4db

Ping @stevenzwu @aheev

Thanks a bunch 🙌 I was waiting for this. I will go ahead and apply in the other PR

@aheev
Copy link
Copy Markdown

aheev commented Dec 8, 2025

Make Actor an Object instead of a string in 3eeb95e. In most real-world szenarios an Actor will have more than one field, and even if not, the Catalog can simply return {"id": "..."}, which IMO would still be clearer and more extendable than a plain string.

@c-thiel do you want to allow JSON values for actor as well?

@c-thiel
Copy link
Copy Markdown
Contributor Author

c-thiel commented Dec 9, 2025

Make Actor an Object instead of a string in 3eeb95e. In most real-world szenarios an Actor will have more than one field, and even if not, the Catalog can simply return {"id": "..."}, which IMO would still be clearer and more extendable than a plain string.

@c-thiel do you want to allow JSON values for actor as well?

Yes - I don't think we should restrict this further.

and basic audit capabilities.

The server encodes all necessary state in the token to ensure
consistent filtering across pages. Clients should use the returned `continuation-token` for
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The server encodes all necessary state in the token to ensure consistent filtering across pages

maybe further remove the consistent filtering part. e.g. sth like

The server encodes all necessary state within the continuation-token. The client should treat this token as a required opaque value and pass it unchanged in subsequent requests to resume the changelog consumption.

obelix74 pushed a commit to obelix74/polaris that referenced this pull request Mar 3, 2026
Updates the Observability REST API proposal to align with the emerging
Iceberg Events API specification (apache/iceberg#12584) per review feedback.

Key changes:
- Changed Events endpoint from GET to POST with request body
- Moved Events API to Iceberg REST Catalog path (/api/catalog/v1/{prefix}/events)
- Adopted Iceberg event structure (event-id, request-id, timestamp-ms, operation)
- Added standard operation types (create-table, update-table, drop-table, etc.)
- Added Polaris custom operation types with x-polaris-* prefix convention
- Updated OpenAPI schemas for Iceberg compatibility
- Added Section 8 documenting Iceberg alignment rationale
- Added mapping table from Polaris internal events to Iceberg operations

References:
- Iceberg Events API PR: apache/iceberg#12584
- Iceberg Events API Doc: https://docs.google.com/document/d/1WtIsNGVX75-_MsQIOJhXLAWg6IbplV4-DkLllQEiFT8
$ref: '#/components/schemas/CustomOperationType'
# Common optional properties
identifier:
$ref: "#/components/schemas/TableIdentifier"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is where a generic CatalogObjectIdentifier is probably a better fit

- $ref: '#/components/schemas/CatalogObjectIdentifier'
example: [ "accounting", "tax" ]

CatalogObjectIdentifier:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My proposal differs in the structure: https://docs.google.com/document/d/1NTQhgNbP2dkIMuXUMA5JdwliVQKCp1TU_ux5J_AaPiw/edit?tab=t.0#heading=h.lrmay9r6i8ai

I suggest we maintain the same structure as TableIdentifier, which would also make the migration easier.

If not provided, events for all objects must be returned subject to other filters.
For specified namespaces, events for the namespaces and all containing objects
(namespaces, tables, views) must be returned (recursively).
catalog-objects-by-id:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

id is a bit vague in this context. if we want uuid here, let's name this as catalog-objects-by-uuid

catalog-objects-by-id:
type: array
items:
$ref: "#/components/schemas/CatalogObjectUuid"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

During implementation, I found this struct a bit unnecessary. It contains two fields: (1) uuid (2) object types. I am wondering if this can be simplified as just a list of UUID strings. There is already a separate filter dimension of object-types. Are we concerned about the UUID collisions across object types (like table and function)?

This is also inconsistent with the catalog-objects-by-name. From the dev thread, we are already aligned that different object types (like table and function) can have the same name/identifier. If we are concerned about the UUID conflict across object types, do we also need per item object type for the catalog-objects-by-name array since name collision is much higher likely to happen.

It is probably simpler to just keep those filter dimensions totally independent.

If not provided, the server may choose a default page size.
Servers may return less results than requested for various reasons, such as
server side limits, payload size or processing time.
after-timestamp-ms:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a client timestamp? if yes, I assume it is for rough times range query? Then do we need the before-timestamp-ms field too to define the time range?

I assume this is not the event timestamp (highest from the last response), as we already have the continuation-token for the query resume point.

I am wondering if we should add this filter right now. Do we have specific use cases in mind?

properties:
next-page-token:
$ref: "#/components/schemas/PageToken"
highest-processed-timestamp-ms:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this require timestamp to be monotonic? if not, what's the value of this field? We already have the continuation-token for the next query resume point.

- highest-processed-timestamp-ms
- events
properties:
next-page-token:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this response field be called continuation-token as well?

@stevenzwu stevenzwu changed the title Proposal: IRC Events endpoint REST Spec: Events endpoint Apr 14, 2026

Consumers should be prepared to handle 410 Gone responses when requested sequences are
outside the server's retention window. Consumers should also de-duplicate received events based
on the event's `event-id`. Consistency guarantees vary between server implementations.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove event's

example: [ "accounting", "tax" ]

CatalogObjectIdentifier:
describe: Reference to a named object in the catalog, such as namespace, table, or view.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, it would be a UDF.

$ref: "#/components/schemas/CatalogObjectUuid"
description: >
Filter events by the list of catalog objects referenced by UUID (tables, views).
object-types:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a note for the reference link

Copy link
Copy Markdown
Contributor

@stevenzwu stevenzwu May 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow up on this discussion #16144 (comment)

@c-thiel object identifier is not unique across whole catalog. e.g. a table and a function can have the same identifier. Currently, it seems that the operation-type (like create-table) covers the object type along with the operation type. Maybe they can be decoupled: object-type would be table and operation-type would be create?

Copy link
Copy Markdown
Contributor

@stevenzwu stevenzwu May 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nm. split into object and operation types would complicate the discriminator field. OpenAPI only allows a single string for discriminator and we can't natively discriminate on a tuple. nested discrimination seems more complex.

But we can still add the object-type: table (etc.) as a const field on each subtype. Wire-level symmetry with the request filter is achieved, and CustomOperation becomes self-describing because it carries an explicit object-type. Cost: the value is technically derivable from operation-type for the standard ops, so it's "redundant" — but for custom ops it's load-bearing, and for clients it's a stable field they don't have to parse.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.