Skip to content

Add burst-proof admission control and fast resource accounting#187

Merged
sjmiller609 merged 6 commits intomainfrom
codex/disk-admission-burst-proofing
Apr 7, 2026
Merged

Add burst-proof admission control and fast resource accounting#187
sjmiller609 merged 6 commits intomainfrom
codex/disk-admission-burst-proofing

Conversation

@sjmiller609
Copy link
Copy Markdown
Collaborator

@sjmiller609 sjmiller609 commented Apr 5, 2026

Summary

  • add burst-proof admission reservations for create, start, and restore
  • enforce disk in the same aggregate admission-control path as cpu, memory, network, and disk I/O
  • speed up resource accounting with cached visible allocations plus cached image, volume, and OCI totals
  • share the same fast visible-allocation path with resource status and hypeman resources

The main thing to balance in this PR is making admission reservation consistent (i.e. burst-proof) but also not slowing down concurrent creation.

Existing Issue

  • Could oversubscribe during bursts
  • Disk was not part of admission control
  • hypeman resources command too slow

What This Changes

  • in-flight lifecycle operations now reserve capacity before slow work starts, so concurrent requests see each other
  • disk admission now uses the same logical/provisioned disk model already reported by Hypeman resources
  • admission control and resource status now use the same cached visible allocation path, so accounting is consistent and much faster

Testing

  • go test ./lib/resources
  • benchmarked current branch against dev-yul-hypeman-1 with a temporary read-only probe built from this code
  • ReserveAllocation: mean 27.6 us, p95 56.1 us
  • GetFullStatus / hypeman resources path: mean 20.0 ms, p95 26.2 ms
  • validated live resource output against the host and confirmed cpu, memory, network, disk I/O, per-instance allocations, and disk overlay/volume accounting matched; image and OCI cache totals matched direct filesystem totals on host, improving accuracy vs currently deployed code

Note

Medium Risk
Touches core admission/resource accounting paths and introduces new in-memory reservation/caching behavior, which could affect capacity enforcement under concurrency. Changes are internal but impact scheduling decisions across CPU/memory/disk/network/disk I/O.

Overview
Implements burst-proof admission control by adding pending resource reservations for in-flight create/start/restore operations, and releasing them only once the instance is marked visible to allocation tracking.

Extends aggregate admission validation to include disk bytes (overlay + volume overlays + images + OCI cache + volumes), and speeds up accounting by caching a conservative instance-allocation view in the instances manager plus cached totals for image/OCI cache bytes and volume bytes (with startup warm-up and incremental refresh on lifecycle/storage changes).

Reviewed by Cursor Bugbot for commit 9566638. Bugbot is set up for automated code reviews on this repo. Configure here.

sjmiller609

This comment was marked as resolved.

@sjmiller609 sjmiller609 requested a review from hiroTamada April 6, 2026 15:18
@sjmiller609 sjmiller609 marked this pull request as ready for review April 6, 2026 15:18
@sjmiller609
Copy link
Copy Markdown
Collaborator Author

Addressed the review follow-ups in the latest commits:

  • switched the read-only CanAllocate / ValidateAllocation paths to RLock
  • documented the intentional visible-then-finish ordering around reservations
  • documented the warmup vs runtime active-state distinction in the cached allocation path
  • hardened disk usage accounting with conservative filesystem fallbacks plus focused tests
  • fixed the restore stale-cache case after guest-network reconfigure failure by rolling the cached visible allocation back before returning the error

Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 9566638. Configure here.

state := string(StateStopped)
if active {
state = string(StateCreated)
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Admission allocation uses stale state for "Created" only

Low Severity

allocationFromStoredMetadata maps all active instances to state "Created" and all inactive to "Stopped". GetFullStatus in resource.go filters per-instance allocation breakdowns with a hardcoded check for "Running" || "Paused" || "Created". Since both paths now share the same ListInstanceAllocations source, instances that were previously in "Initializing" state (and excluded from the allocation breakdown) are now included as "Created". This is a subtle behavioral change to the /resources API response for any consumer relying on the previous filtering semantics.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 9566638. Configure here.

Copy link
Copy Markdown
Contributor

@hiroTamada hiroTamada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clearing pending review

Copy link
Copy Markdown
Contributor

@hiroTamada hiroTamada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

solid PR — reservation pattern is correctly and consistently applied across create/start/restore, failure rollback paths are handled, caching is a real performance win, and disk admission closes a meaningful gap. one minor comment about cache reconciliation.

@sjmiller609 sjmiller609 marked this pull request as draft April 7, 2026 13:03
@sjmiller609 sjmiller609 force-pushed the codex/disk-admission-burst-proofing branch from f7debfc to facac17 Compare April 7, 2026 13:28
@sjmiller609 sjmiller609 marked this pull request as ready for review April 7, 2026 13:28
@sjmiller609 sjmiller609 merged commit 76a8772 into main Apr 7, 2026
10 of 13 checks passed
@sjmiller609 sjmiller609 deleted the codex/disk-admission-burst-proofing branch April 7, 2026 13:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants