Skip to content

Auto-standby#183

Merged
sjmiller609 merged 23 commits intomainfrom
codex/auto-standby-e2e
Apr 7, 2026
Merged

Auto-standby#183
sjmiller609 merged 23 commits intomainfrom
codex/auto-standby-e2e

Conversation

@sjmiller609
Copy link
Copy Markdown
Collaborator

@sjmiller609 sjmiller609 commented Apr 4, 2026

Summary

  • add Linux-only auto-standby built around host conntrack state in a new lib/autostandby package
  • persist and expose per-instance auto_standby policy through instance metadata and API surfaces
  • start the auto-standby controller from the API process and add a default-skipped VM-level E2E test for host->guest inbound TCP activity

Testing

  • go test -count=1 ./lib/autostandby
  • go test -count=1 -run "Test(ValidateUpdateInstanceRequest|CloneStoredMetadataForFork_DeepCopiesReferenceFields)$" ./lib/instances
  • go test -count=1 -run "Test(CreateInstance_MapsAutoStandbyPolicy|UpdateInstance_MapsAutoStandbyPatch)$" ./cmd/api/api
  • go test -run "^$" ./cmd/api
  • sudo -n env PATH=/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin:$PATH HYPEMAN_RUN_AUTO_STANDBY_E2E=1 go test -count=1 -run ^TestAutoStandbyCloudHypervisorActiveInboundTCP$ ./lib/instances on deft-kernel-dev

Integration test coverage

The default-skipped Linux integration test exercised a real Cloud Hypervisor VM with networking enabled and a real conntrack-backed auto-standby controller.

It verified that:

  • a host-to-guest TCP connection to nginx appears as qualifying inbound activity in conntrack
  • the instance stays Running while that inbound TCP connection remains open
  • once the final inbound TCP connection closes, the controller allows the configured idle timeout to elapse and then transitions the instance to Standby
  • the test uses the real Linux conntrack path instead of ingress state or TAP byte counters

Note

Medium Risk
Introduces a new background controller that can transition VMs to Standby based on host conntrack state and persists new per-instance metadata, which could affect instance lifecycle behavior if misconfigured. Risk is mitigated by being opt-in via an auto_standby policy and having validation plus tests, but it touches core instance create/update paths and process startup.

Overview
Adds an opt-in auto-standby feature that monitors host-side IPv4 TCP conntrack activity and automatically places eligible Linux VMs into Standby after a configured idle timeout.

Exposes the per-instance auto_standby policy through the instance create/update APIs (with validation and OAPI mapping), persists controller-owned runtime timestamps in metadata.json, and adds a new per-instance GetAutoStandbyStatus diagnostic endpoint.

Wires the new autostandby.Controller into the API process lifecycle (Wire provider + startup goroutine), and includes unit tests plus a default-skipped Linux E2E test that verifies host→guest TCP activity prevents standby until connections close.

Reviewed by Cursor Bugbot for commit 896ee1a. Bugbot is set up for automated code reviews on this repo. Configure here.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 4, 2026

✱ Stainless preview builds

This PR will update the hypeman SDKs with the following commit message.

feat: Add Linux auto-standby controller and E2E coverage
hypeman-openapi studio · code

Your SDK build had at least one "note" diagnostic.
generate ✅

hypeman-go studio · code

Your SDK build had at least one "note" diagnostic.
generate ✅build ⏭️lint ✅test ✅

go get github.com/stainless-sdks/hypeman-go@211a8a5c99cd759999d46d69f290717fb6eeabf1
hypeman-typescript studio · code

Your SDK build had at least one "note" diagnostic.
generate ✅build ✅lint ✅test ✅


This comment is auto-generated by GitHub Actions and is automatically kept up to date as you push.
If you push custom code to the preview branch, re-run this workflow to update the comment.
Last updated: 2026-04-07 22:36:42 UTC

@sjmiller609 sjmiller609 changed the title Add Linux auto-standby controller and E2E coverage Auto-standby Apr 4, 2026
@sjmiller609 sjmiller609 marked this pull request as ready for review April 4, 2026 20:15
@sjmiller609 sjmiller609 requested a review from hiroTamada April 4, 2026 20:15
@sjmiller609 sjmiller609 marked this pull request as draft April 4, 2026 20:16
@sjmiller609 sjmiller609 removed the request for review from hiroTamada April 5, 2026 15:31
sjmiller609

This comment was marked as resolved.

@sjmiller609 sjmiller609 marked this pull request as ready for review April 6, 2026 15:11
@sjmiller609 sjmiller609 requested a review from hiroTamada April 6, 2026 15:11
@sjmiller609 sjmiller609 requested a review from hiroTamada April 6, 2026 15:43
@sjmiller609 sjmiller609 marked this pull request as draft April 6, 2026 17:37
@sjmiller609 sjmiller609 requested a review from hiroTamada April 6, 2026 21:53
@sjmiller609 sjmiller609 marked this pull request as ready for review April 6, 2026 21:53
Copy link
Copy Markdown
Contributor

@hiroTamada hiroTamada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well-designed feature. the single-goroutine event loop with conntrack integration is solid, state machine is correct, restart recovery via persisted runtime is thoughtful, and the periodic snapshot sync provides a good safety net. code is clean, well-tested, and the README is excellent.

one minor nit inline about O(N) iteration per conntrack event — not blocking, just something to revisit if host density grows significantly.

@hiroTamada
Copy link
Copy Markdown
Contributor

a few non-blocking observations from the review:

  1. lifecycle event gap for auto-standby-only updates — the early-return path in updateInstance (when len(req.Env) == 0) saves metadata and returns without emitting a lifecycle event. the controller won't learn about the policy change until the next periodic snapshot sync (5 min). if this hasn't been wired up elsewhere in the lifecycle events system, might be worth a quick follow-up to emit an InstanceEventUpdate on that path.

  2. no test for handleStandbyTimer calling StandbyInstance — the fakes are already set up to track it (fakeInstanceStore.standbyIDs), just no test exercises the timer-fires → standby-called → state-cleared path, or the retry-on-failure path. would be a good addition given it's the most critical action the controller takes.

  3. PR scope — this bundles several distinct features (auto-standby, distributed tracing refactor, admission control, image retention, WaitForState). each piece is individually clean, but the combined diff (~13k lines, 151 files) makes it a large review surface. not a blocker, just noting for future PRs.

@sjmiller609

This comment was marked as resolved.

@sjmiller609
Copy link
Copy Markdown
Collaborator Author

Follow-up on Hiro's top-level notes:

  • I added focused controller tests for the standby timer path in f9800fe, covering both the successful handleStandbyTimer -> StandbyInstance -> state cleared flow and the failure/rearm path.
  • I also added a manager-level regression test proving an auto-standby-only UpdateInstance still publishes LifecycleEventUpdate through the public manager wrapper, so that path does not wait for the 5-minute snapshot sync.
  • On PR scope: agreed in principle about keeping review surfaces smaller, but I don't think the current branch includes unrelated features like admission control, image retention, or WaitForState; this PR is the auto-standby work plus closely related wiring/docs/tests.

Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit f9800fe. Configure here.

@sjmiller609 sjmiller609 merged commit 9f6b171 into main Apr 7, 2026
9 of 11 checks passed
@sjmiller609 sjmiller609 deleted the codex/auto-standby-e2e branch April 7, 2026 22:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants