Skip to content

add rfc for compute driver lifecycle extensions#1346

Draft
cheese-head wants to merge 1 commit into
NVIDIA:mainfrom
cheese-head:feature/compute-driver-lifecycle
Draft

add rfc for compute driver lifecycle extensions#1346
cheese-head wants to merge 1 commit into
NVIDIA:mainfrom
cheese-head:feature/compute-driver-lifecycle

Conversation

@cheese-head
Copy link
Copy Markdown

Summary

Related Issue

Changes

Testing

  • mise run pre-commit passes
  • Unit tests added/updated
  • E2E tests added/updated (if applicable)

Checklist

  • Follows Conventional Commits
  • Commits are signed off (DCO)
  • Architecture docs updated (if applicable)

Signed-off-by: Patrick Riel <priel@nvidia.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 13, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Copy link
Copy Markdown
Collaborator

@drew drew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This proposal looks good to me. I think the phases look correct, and it cleanly opens up compute drivers for extension.


## Motivation

The gateway's compute subsystem owns sandbox lifecycle and delegates platform-specific work to a compute driver through the `ComputeDriver` trait (`CreateSandbox`, `DeleteSandbox`, `WatchSandboxes`, `Reconcile`). Some drivers implement that trait directly in-process; others are subprocesses that the gateway reaches through a gRPC client that implements the same trait. Either way the boundary is intentionally narrow: it only covers what the driver itself must do.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Reconcile isn't part of hte compute driver, but rather implemented by the gateway

## Open questions

- Should `reconcile` be authoritative (mutate external resources to match state) or advisory (log and alert), or extension-by-extension? Lean: extension decides via `ReconcileOutcome`, default advisory.
- Should out-of-tree extensions be a first-class supported model (a downstream builds its own gateway binary with extra extensions linked) or strictly internal? Lean first-class, with best-effort API stability.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd go with start internal, and then open for extension later.


- Should `reconcile` be authoritative (mutate external resources to match state) or advisory (log and alert), or extension-by-extension? Lean: extension decides via `ReconcileOutcome`, default advisory.
- Should out-of-tree extensions be a first-class supported model (a downstream builds its own gateway binary with extra extensions linked) or strictly internal? Lean first-class, with best-effort API stability.
- Should we add a `dry_run` phase for previewing planned mutations without committing? Useful for the policy advisor; out of scope for v1.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 can leave out of scope

- Should `reconcile` be authoritative (mutate external resources to match state) or advisory (log and alert), or extension-by-extension? Lean: extension decides via `ReconcileOutcome`, default advisory.
- Should out-of-tree extensions be a first-class supported model (a downstream builds its own gateway binary with extra extensions linked) or strictly internal? Lean first-class, with best-effort API stability.
- Should we add a `dry_run` phase for previewing planned mutations without committing? Useful for the policy advisor; out of scope for v1.
- Should extensions be able to attach their own tracing spans and OCSF events beyond what the framework supplies? Lean yes; `LifecycleContext` should carry a span that extensions extend.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to enabling this. I can see this be very useful to add additional obs signals.

2. Wire the extension chain into the compute subsystem's create, delete, and reconcile paths. Empty and single-element chains are the default and exhibit no behavior change.
3. Extend the gateway sandbox-state store with a `(sandbox_id, extension_name) → ExtensionState` namespace and a startup reconcile pass over it.
4. Add the `compute.extensions` field to the gateway configuration file and the matching environment variable. Validate names against the compiled-in registry at startup.
5. Ship a reference no-op extension in `crates/openshell-extension-example` as a template for downstream authors.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want this in the main tree, or perhaps in examples/openshell-extension-example?

[compute]
extensions = ["fleet-labels", "workload-identity"]
```

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would individual extensions be configured? Maybe something like

[compute.extension."fleet-labels"]
labels = { fleet = "prod", region = "us-west" }

[compute.extension."workload-identity"]
provider = "aws"
role_arn = "arn:aws:iam::123456789012:role/openshell-sandbox"


## Open questions

- Should `reconcile` be authoritative (mutate external resources to match state) or advisory (log and alert), or extension-by-extension? Lean: extension decides via `ReconcileOutcome`, default advisory.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

advisory seems right to start

@drew drew self-assigned this May 14, 2026
@drew drew added the rfc label May 14, 2026
@drew drew moved this from Todo to In progress in OpenShell Roadmap May 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

Status: In progress

Development

Successfully merging this pull request may close these issues.

2 participants