Skip to content

feat(mesh): event hooks, auto-reconnect, registry client, Ed25519 verify, heartbeat#2090

Merged
imran-siddique merged 8 commits into
microsoft:mainfrom
pallakatos:azureclaw-meshclient-event-hooks
May 11, 2026
Merged

feat(mesh): event hooks, auto-reconnect, registry client, Ed25519 verify, heartbeat#2090
imran-siddique merged 8 commits into
microsoft:mainfrom
pallakatos:azureclaw-meshclient-event-hooks

Conversation

@pallakatos

@pallakatos pallakatos commented May 11, 2026

Copy link
Copy Markdown
Contributor

Summary

This PR closes 5 reliability/correctness gaps in the AgentMesh
TypeScript client + Python registry that surfaced when running
@microsoft/agent-governance-sdk against real-world workloads
(microsoft/azureclaw — a multi-agent runtime that stress-tests the
mesh with chunked file transfers, KNOCK races, and long-lived
WebSocket sessions across pod restarts).

All gaps were identified by patch-by-patch comparison against the
vendored fork of agentmesh (amitayks/agentmesh) that AzureClaw
shipped before adopting AGT. AGT is now feature-equivalent for our
use-case once these land
; we can drop our vendored fork.

What changes

TypeScript SDK (agent-governance-typescript/)

  1. feat(mesh-client): event hooksonError, onDisconnect,
    onE2EVerified. Lets host runtimes observe lifecycle events
    without hand-patching the client.

  2. feat(mesh-client): KNOCK auto-bootstrap (G1) + auto-reconnect (G2)

    • Receiver-side X3DH bootstrap: when a peer KNOCKs and we have no
      session, we resolve their bundle, run X3DH ourselves, and ACK.
      Mirrors vendored sdk patch #4b. Without this, the first
      message from a new peer is dropped.
    • Auto-reconnect with exponential backoff capped at 60s; opt-out
      via autoReconnect: false. Mirrors vendored sdk patch #9.
      Without this, a single network blip silently mesh-deafens the
      client forever.
  3. feat(mesh): G3 + G4 + G5

  4. feat(mesh): RegistryClient + auto-register on connect()

    • New registry-client.ts (~394 LoC) wrapping the Python
      registry's REST surface (POST/GET/PUT/DELETE /v1/agents,
      /lookup, /discover, /heartbeat).
    • MeshClient.connect() now optionally auto-registers
      (autoRegister: true, default), uploads prekeys, and starts a
      periodic heartbeat. Opt-out for tests.
    • 11 new tests covering registration, prekey upload, search,
      heartbeat, and timeout/error paths.
  5. fix(mesh): include Ed25519 identity_key_ed in prekey bundle

    • Soundness fix. PreKeyBundle.identityKeyEd was previously
      aliased to the X25519 identity_key, so verifyBundle() would
      accept any signed pre-key — defeating X3DH's signature check.
    • Now: registry stores both identity_key (X25519) and
      identity_key_ed (Ed25519). Bundle ships both. verifyBundle()
      rejects bundles missing identity_key_ed.
    • ⚠️ Wire-incompatible by design — receivers on this version
      will reject prekey bundles uploaded by older clients (correct
      behavior; the old bundles were unverifiable). Coordinate
      deployment.

Python registry (agent-governance-python/agent-mesh/)

  1. feat(registry): POST /v1/agents/{did}/heartbeat

    • Bumps last_seen without re-uploading the full bundle.
    • Lets long-lived clients prove liveness cheaply (the JS
      RegistryClient calls this every 30s when registered).
    • Tests: test_registry.py round-trip.
  2. Companion identity_key_ed storageAgentRecord gets a new
    optional field; CRUD endpoints round-trip it.

Test results

  • TypeScript: npm test — 426/426 pass, 34 suites
  • Python: existing registry tests + new heartbeat assertions pass
    (pytest agent-mesh/tests/test_registry.py)
  • Integration: deployed to AzureClaw production AKS cluster
    (mesh.provider=agt); 7 sub-agents heartbeat at 30s cadence,
    cross-provider sibling discovery works, chunked file-transfer
    (which needs both G3 and G4) succeeds end-to-end.

Compatibility notes

  • All TypeScript additions are opt-in or backward-compatible via
    default options:
    • autoReconnect: true (new default — previously connections were
      one-shot). Set to false to preserve old behavior.
    • autoRegister: true (new default if keyManager is provided).
      Set to false for tests / external registry management.
    • Event hooks are no-ops if no handler is registered.
  • The Ed25519 identity_key_ed field is wire-incompatible with
    older clients (verifyBundle rejects bundles without it). Treat as
    a coordinated upgrade. We can split this commit out into a separate
    PR if you'd prefer to land the soundness fix on its own deployment
    cadence.

Commits

SHA Subject
e5f4346f feat(mesh-client): add onError, onDisconnect, onE2EVerified event hooks
d75ea37b feat(mesh-client): close gaps G1 (KNOCK auto-bootstrap) and G2 (auto-reconnect)
3a96a0f2 feat(mesh): close vendored-parity gaps G3, G4, G5
f5db53f3 feat(mesh): add RegistryClient and auto-register on MeshClient.connect()
081e8efc feat(registry): add POST /v1/agents/{did}/heartbeat to bump last_seen
dff0969c fix(mesh): include Ed25519 identity_key_ed in prekey bundle so verifyBundle() works
b8bf31eb test(mesh): set autoRegister:false in upstream knock + malformed-frame tests

(The merge commit 1c2218d8 brings in 205 commits from main that
landed during our fork lifetime; nothing of ours is in there. The
b8bf31eb test fix is for two upstream tests that became flaky after
my registry-client changes — they construct MeshClient without
autoRegister: false and now hit a real fetch.)

Asks

  1. Review the wire-incompat Ed25519 fix (dff0969c) carefully —
    we can split it off if you want to ship the rest first.
  2. Happy to break this into multiple smaller PRs (event hooks /
    reconnect / registry / Ed25519 fix / heartbeat) if reviewers
    prefer — let me know.

Pal Lakatos-Toth and others added 8 commits May 10, 2026 01:28
Adds three observer-registration methods to MeshClient to align with the
AzureClaw vendored AgentMesh SDK surface so consumers can swap providers
behind a single transport interface:

  onError(handler)        — fires on ws errors and decrypt failures
  onDisconnect(handler)   — fires on ws.close with reason+code
  onE2EVerified(handler)  — fires on first successful decrypt per peer

Pure additions; no behaviour change for existing flows. Decrypt-failure
and missing-session paths now drop the message AND notify error
observers instead of silently dropping (still safe — unhandled events
are swallowed).

Tests: 8 new in mesh-client-event-hooks.test.ts; full suite 387/387 green.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…reconnect)

Two protocol-level gaps were identified during the AzureClaw vendored-SDK
audit (see Azure/kars#245 docs/agt-vs-vendored-sdk.md). Both block
moving fully off the vendored agentmesh-sdk fork onto upstream AGT.

## G1: receiver-side X3DH auto-bootstrap from KNOCK

Before: the responder side of an encrypted session was created only by
calling acceptSession(peerId, establishment) explicitly. But the
ChannelEstablishment was never carried on the wire — neither in
knock_accept nor in the first message frame — so the receiver had no way
to obtain it. Result: any fresh encrypted session failed with
"No encrypted session" the first time a ciphertext arrived.

Mirrors vendored agentmesh-sdk patch #4b. Behavior:

- Sender (establishSession): builds the SecureChannel first, then sends
  KNOCK with the ChannelEstablishment embedded as
  { ik: base64, ek: base64, otk?: number }.
- Receiver (handleKnock): if the knock contains `establishment` AND no
  prior session exists, deserialize and call acceptSession() automatically
  before responding with knock_accept. On bad establishment data, the
  knock is rejected and onError("knock", ...) fires.
- Backwards-compatible: legacy peers that don't embed establishment
  continue to work; caller must invoke acceptSession() manually as before.

## G2: auto-reconnect on transport drop

Before: MeshClient.reconnect() existed but was never called automatically.
Network blips, relay restarts, AKS node OOM-evicts left the client
disconnected forever — agents went mesh-deaf.

Mirrors vendored agentmesh-sdk patch microsoft#9. Behavior:

- New options: autoReconnect (default true), maxReconnectAttempts
  (default Number.POSITIVE_INFINITY), reconnectBaseDelayMs (default
  1000), reconnectMaxDelayMs (default 60000).
- On non-1000 ws.onclose: schedule reconnect with exponential backoff
  capped at reconnectMaxDelayMs. Light jitter (±20%) avoids
  thundering-herd reconnects across many sandboxes.
- onDisconnect handlers fire BEFORE the reconnect is scheduled so
  observers can see the drop.
- After maxReconnectAttempts, fires onError("ws", ..., "auto-reconnect
  gave up after N attempts") and stops.
- disconnect() cancels any pending reconnect timer.
- A connect() failure inside the reconnect path schedules another retry
  via the existing onclose path (and recursively from the catch handler
  in scheduleReconnect for the case where ws.onopen never fired).

## Tests

11 new tests across two files:
- tests/mesh-client-knock-bootstrap.test.ts: 5 tests covering sender-side
  embedding, receiver auto-bootstrap, legacy fallback, malformed
  establishment, and end-to-end happy-path encrypted send.
- tests/mesh-client-auto-reconnect.test.ts: 6 tests covering server-close
  reconnect, client-close no-reconnect, opt-out, give-up after max
  attempts, disconnect cancels pending timer, no duplicate scheduling.

Full AGT TS suite: 398/398 pass (was 387 before this commit). Build clean.

Both gaps were identified in the AzureClaw audit doc:
docs/agt-vs-vendored-sdk.md (Patch-by-patch audit section).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
G3 — session-desync teardown (vendored agentmesh-sdk patch microsoft#13 equiv):
when Double Ratchet decryption fails for an existing session, the local
ratchet state is irrecoverable. Previously we only fired
onError('decrypt'); now we tear down the session (delete from
this.sessions; clear knockAccepted) and fire a distinct
onError('session_desync') so callers can re-run establishSession() and
resume communication.

G4 — pre-KNOCK encrypted-message buffer (vendored agentmesh-sdk
patch microsoft#16 equiv): under transport reordering the relay can deliver an
encrypted 'message' frame BEFORE the matching 'knock' arrives. Without
buffering, that first frame was dropped silently. Now MeshClient buffers
encrypted frames per-peer (default cap 5, TTL 3000ms), replays them
when the corresponding knock is accepted, and drops them when rejected
or on disconnect. Disabled by setting preKnockBufferSize: 0.

G5 — eager ghost-connection close on rebind (vendored agentmesh-relay
patch microsoft#2 equiv): when an agent reconnects with the same DID, the prior
'ghost' WebSocket is now closed eagerly with code 1000 'session_replaced'
instead of waiting for the 90s heartbeat-eviction timer. The 'finally'
cleanup now compares socket identity to avoid removing the freshly
rebound connection when the old socket's handler unwinds.

Tests: +5 (G4 pre-knock buffer behaviour), +2 (G3 type + lifecycle
safety), +1 (G5 ghost rebind). 405 TS tests pass; 18 relay tests pass.

NOTE: held local-only on branch azureclaw-meshclient-event-hooks; will
be coordinated upstream with the AGT team.
Closes the structural gap where MeshClientOptions declared registryUrl
but no code path ever used it. Without this, agents can connect to the
relay and exchange ciphertext but are invisible to peers — they never
appear in /v1/discover and no one can fetch their X3DH pre-key bundle.

This commit adds the missing registry-side glue, in upstream-quality
shape, so AGT MeshClient becomes a drop-in replacement for vendored
forks that wired their own registry calls (e.g. AzureClaw vendor/agentmesh-sdk).

What's new:

- src/encryption/registry-client.ts: typed HTTP client wrapping the
  AgentMesh registry's REST surface (POST /v1/agents,
  PUT/GET /v1/agents/{did}/prekeys, GET /v1/discover, GET/DELETE
  /v1/agents/{did}). base64url <-> Uint8Array marshalling, optional
  Ed25519-Timestamp Authorization signer, configurable retry/timeout.

- src/encryption/mesh-client.ts:
  * New options: registryClient, registryClientOptions, capabilities,
    registrationMetadata, oneTimePrekeyCount, autoRegister.
  * connect() now calls registerSelf() at the end of a successful WS
    handshake when autoRegister is true (default) and a registry is
    configured. Idempotent — 409 (already registered) is treated as
    success; reconnect doesn't re-register.
  * registerSelf(): generates signed-prekey + N one-time prekeys via
    keyManager, then POSTs /v1/agents (capabilities = [displayName,
    ...options.capabilities]) and PUTs the prekey bundle.
  * discover(capability): wraps registry.discover.
  * establishSessionWithPeer(peerId): fetches peer prekeys from the
    registry and calls existing establishSession.
  * getRegistry(): exposes the underlying client for advanced cases.

- tests/registry-client.test.ts: 11 new tests covering wire format,
  base64url round-trip, idempotent register (409), 5xx retry, and
  Authorization header. Also covers MeshClient auto-register on connect
  end-to-end with a fake fetch + fake WebSocket.

- tests/mesh-client-*.test.ts: existing fixtures updated to pass
  autoRegister: false (no registry available in those tests). Behavior
  unchanged for the production path.

All 71 tests pass (60 previously + 11 new).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The registry's last_seen field is set on registration but never
updated by any HTTP handler — update_last_seen() exists in
store.py but is dead code in production. As a result, every
registered agent looks 'offline' (online=false on /presence) 90s
after spawn, and any client-side discover stale-filter rejects
all live agents.

Add an unauthenticated POST /v1/agents/{did}/heartbeat that calls
store.update_last_seen(). Idempotent. Returns 404 for unknown DIDs
so callers can detect a registry restart and re-register.

Mirrors the agent-side ping cadence (30s, well within the 90s
presence threshold).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…Bundle() works

PreKeyBundle.identityKeyEd was previously aliased to the X25519 identity_key
because the registry didn't store the Ed25519 signing key. As a result,
verifyBundle() either passed only because the stub test never exercised it,
or quietly accepted any signed pre-key — a soundness gap in the X3DH wrapper.

This commit threads identityKeyEd through the full path:

- agent-mesh/registry/store.py (AgentRecord): add identity_key_ed field
  (X25519 long-term key already present as identity_key; Ed25519 distinct).
- agent-governance-typescript/src/encryption/x3dh.ts: expose
  X3DHKeyManager.identityKeyEd getter.
- agent-governance-typescript/src/encryption/registry-client.ts:
  uploadPrekeys() now accepts identityKeyEd (32 bytes, validated) and
  serializes it as identity_key_ed; fetchPrekeys() deserializes it back into
  PreKeyBundle.identityKeyEd. Older bundles without the field fall back to
  the X25519 key — verifyBundle() will then correctly reject (forensic
  visibility, no silent bypass).
- agent-governance-typescript/src/encryption/mesh-client.ts: registerSelf()
  passes keyManager.identityKeyEd through to the registry.
- tests/registry-client.test.ts + tests/test_registry.py: assert the new
  field is round-tripped on PUT/GET.

Wire compatibility: PUT now requires identity_key_ed; GET emits it when set.
Existing AGT clients that don't upload it will get verifyBundle() rejection
on the receiving side — peers must upgrade together. (No silent regression
on the wire — the field is new, not a breaking change to existing fields.)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…event-hooks

# Conflicts:
#	agent-governance-typescript/src/encryption/mesh-client.ts
…e tests

These two upstream tests construct MeshClient without autoRegister:false,
which now triggers a real fetch to the registry on connect() (added by
our registry-client commit). The tests have no fake registry, so they
fail with 'fetch failed'. autoRegister:false skips the auto-registration
path and matches the pattern used in our 6 sibling mesh-client-*.test.ts
files.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@github-actions

Copy link
Copy Markdown

Welcome to the Agent Governance Toolkit! Thanks for your first pull request.
Please ensure tests pass, code follows style (ruff check), and you have signed the CLA.
See our Contributing Guide.

@github-actions github-actions Bot added the size/XL Extra large PR (500+ lines) label May 11, 2026
@github-actions

Copy link
Copy Markdown

PR Review Summary

Check Status Details
🔍 Code Review ⏳ Pending Awaiting results
🛡️ Security Scan ⏳ Pending Awaiting results
🔄 Breaking Changes ⏳ Pending Awaiting results
📝 Docs Sync ⏳ Pending Awaiting results
🧪 Test Coverage ⏳ Pending Awaiting results

Verdict: ⏳ Still running

@imran-siddique imran-siddique marked this pull request as ready for review May 11, 2026 22:24
@imran-siddique imran-siddique merged commit 6c3af4c into microsoft:main May 11, 2026
84 of 95 checks passed
imran-siddique added a commit that referenced this pull request May 11, 2026
…th (#2092)

- Wrap messageHandlers loop in try/catch to match errorHandlers/disconnectHandlers
  (prevents one throwing handler from killing subsequent handlers)
- Validate identity_key_ed is exactly 32 bytes after base64 decode
  (rejects garbage bytes that would break downstream Ed25519 verification)
- Require reporter_amid to be a registered session participant on
  reputation/session endpoint (prevents anonymous reputation manipulation)

Follow-up to #2090.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@github-actions github-actions Bot added tests agent-mesh agent-mesh package labels May 11, 2026
imran-siddique added a commit that referenced this pull request May 11, 2026
…ail-fast (#2093)

* fix(mesh): harden message handlers, key validation, and reputation auth

- Wrap messageHandlers loop in try/catch to match errorHandlers/disconnectHandlers
  (prevents one throwing handler from killing subsequent handlers)
- Validate identity_key_ed is exactly 32 bytes after base64 decode
  (rejects garbage bytes that would break downstream Ed25519 verification)
- Require reporter_amid to be a registered session participant on
  reputation/session endpoint (prevents anonymous reputation manipulation)

Follow-up to #2090.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* fix(mesh): global buffer cap, heartbeat rate limit, identity_key_ed validation

- Add maxBufferedPeers option (default 100) to MeshClient to prevent
  memory exhaustion from adversary sending from many distinct DIDs
- Rate-limit heartbeat endpoint to 1 per 10s per agent (atomic, via
  store-level try_update_last_seen) to prevent stale agent keepalive
- Reject fetchPrekeys when peer bundle lacks identity_key_ed instead
  of silently substituting X25519 key (fail-fast on incompatible peers)
- Add tests for all three hardening measures plus identity_key_ed
  length validation and session reputation participant check

TS: 429/429 tests pass. Python: 25/25 tests pass.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
pallakatos pushed a commit to Azure/kars that referenced this pull request May 12, 2026
Following AGT PR microsoft/agent-governance-toolkit#2090 merging upstream
(commit 6c3af4c on 2026-05-11), all 5 gap-closing fixes (G1–G5) plus
event hooks + RegistryClient + heartbeat + Ed25519 verify are now in
the published @microsoft/agent-governance-sdk. The vendored Rust
relay/registry and patched TS SDK fork are no longer needed.

CLI cleanup:
- dev.ts: remove vendored provider prompt, build branches, postgres
  startup, env-var branching, run dispatch; AGT is the only choice.
- dev/local-k8s.ts: deployAgentMesh() is AGT-only; manifest path,
  image build, image rewrite all collapse to single branch.
- mesh/health.ts: drop /v1/health + WS-on-/ vendored fallbacks.
- mesh/provider.ts: deleted (live vendored↔AGT switcher pointless).
- push.ts + push.test.ts: relay/registry images dropped (sandbox image
  list 6→4); tests updated; mesh images now built only via
  azureclaw push --only relay/--only registry from the AGT repo.
- up.ts + up/agentmesh_deploy.ts: vendored buildPush + postgres ACR
  import + db-credentials secret all gone; only agentmesh-agt.yaml
  manifest applied.
- sandbox-hardening.test.ts: drop /opt/azureclaw-vendored-sdk read-only
  assertion (vendored overlay no longer in Dockerfile).

Vendor cleanup:
- vendor/agentmesh-{sdk,relay,registry}/ — deleted entirely
- ci/vendored-patch-audit.sh — deleted
- deploy/agentmesh.yaml — deleted (only agentmesh-agt.yaml remains)

639/639 CLI tests pass. tsc --noEmit clean.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
pallakatos added a commit to Azure/kars that referenced this pull request May 12, 2026
* feat(mesh): Phase 5 — AGT as default + local-k8s mesh deploy

Flip the default mesh provider from "vendored" to "agt" across the
entire stack. The dual-provider plumbing (Provider enum, factory,
both manifests, both adapters) stays — only the default changes, so
operators can still opt back via AZURECLAW_MESH_PROVIDER=vendored or
`--mesh-provider=vendored`.

Default flipped in:
  * deploy/helm/azureclaw/values.yaml + controller-deployment.yaml
  * controller/src/mesh_peer/mod.rs (Provider::from_env)
  * controller/src/reconciler/mod.rs (sandbox env propagation)
  * sandbox-images/openclaw/entrypoint.sh
  * cli/src/commands/{dev,up,push}.ts (+ subcommands)
  * runtimes/openclaw/src/{index.ts,core/agt-heartbeat.ts}
  * mesh-plugin/src/transport-factory.ts (resolveMeshProvider)

Local-k8s mesh deployment (Phase 5.3):
The kind path previously helm-installed the controller but never
deployed agentmesh-relay/registry, so the controller looped on
`agentmesh-relay:8765` not resolving. Now `runLocalK8s` builds
the relay+registry images (AGT Python from --agt-repo, or vendored
Rust from vendor/), loads them into kind, rewrites the manifest's
ACR image refs to local tags + imagePullPolicy=Never, applies, and
waits for both rollouts before the controller check. Adds
`--no-mesh` opt-out for pure controller smoke tests.

Test updates:
  * mesh-plugin/src/transport-factory.test.ts — defaults flipped,
    vendored opt-in path covered.

Verified: cargo build --release ✓, cargo test --all ✓ (492+),
cli typecheck ✓, cli vitest 640/640 ✓, mesh-plugin 98/98 ✓,
runtimes/openclaw build ✓.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* docs: Phase 5 — AGT-default callouts in CHANGELOG + agt-vs-vendored

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* fix(dev): honor --global-registry in --target local-k8s

The Phase 5.3 deployAgentMesh() helper ran unconditionally, which
would have stood up a second local relay+registry on top of an
already-reachable external one — wasteful at best, port-conflicty
at worst. Skip the in-kind deployment when --global-registry is
set (or piped through from 'azureclaw mesh promote --port-forward').

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* fix(dev): replace regex with String.replaceAll for ACR image rewrites

CodeQL flagged the new RegExp constructors in deployAgentMesh as
'missing regular expression anchor' and 'incomplete hostname regexp'
— the dots in 'azureclawacr.azurecr.io' aren't escaped, so a
malicious hostname could match. Functionally fine for our manifest
(only ACR strings present), but switching to plain String.replaceAll
removes the smell entirely.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* refactor(mesh-plugin): drop @agentmesh/sdk dependency (Phase 5.2)

- identity.ts uses node:crypto for Ed25519 + X25519 (no sodium-native fork)
- transport-factory collapsed to AGT-only; legacy 'vendored' env ignored
- delete vendored MeshConnection adapter + type shim + dead tests
- ci.yml / ci-gates.yml / Makefile / Cargo.toml drop vendor/agentmesh refs
- sandbox Dockerfile drops vendored-SDK overlay (npm @microsoft/agent-governance-sdk only)

63 mesh-plugin vitest tests pass; vendored fork removal continues in
follow-up commits (CLI + controller + runtime + docs).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* refactor(cli): drop vendored mesh-provider branches (Phase 5.2)

Following AGT PR microsoft/agent-governance-toolkit#2090 merging upstream
(commit 6c3af4c on 2026-05-11), all 5 gap-closing fixes (G1–G5) plus
event hooks + RegistryClient + heartbeat + Ed25519 verify are now in
the published @microsoft/agent-governance-sdk. The vendored Rust
relay/registry and patched TS SDK fork are no longer needed.

CLI cleanup:
- dev.ts: remove vendored provider prompt, build branches, postgres
  startup, env-var branching, run dispatch; AGT is the only choice.
- dev/local-k8s.ts: deployAgentMesh() is AGT-only; manifest path,
  image build, image rewrite all collapse to single branch.
- mesh/health.ts: drop /v1/health + WS-on-/ vendored fallbacks.
- mesh/provider.ts: deleted (live vendored↔AGT switcher pointless).
- push.ts + push.test.ts: relay/registry images dropped (sandbox image
  list 6→4); tests updated; mesh images now built only via
  azureclaw push --only relay/--only registry from the AGT repo.
- up.ts + up/agentmesh_deploy.ts: vendored buildPush + postgres ACR
  import + db-credentials secret all gone; only agentmesh-agt.yaml
  manifest applied.
- sandbox-hardening.test.ts: drop /opt/azureclaw-vendored-sdk read-only
  assertion (vendored overlay no longer in Dockerfile).

Vendor cleanup:
- vendor/agentmesh-{sdk,relay,registry}/ — deleted entirely
- ci/vendored-patch-audit.sh — deleted
- deploy/agentmesh.yaml — deleted (only agentmesh-agt.yaml remains)

639/639 CLI tests pass. tsc --noEmit clean.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* refactor(controller,router): drop vendored mesh-provider plumbing (Phase 5.2)

- controller: remove Provider::Vendored enum + handle_vendored_frame
- inference-router: collapse mesh signing/routes to AGT-only
- net -397 LOC

cargo build/test/clippy/fmt all green (1361 tests).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* refactor(runtime): drop @agentmesh/sdk, use node:crypto via @azureclaw/mesh (Phase 5.2)

OpenClaw runtime no longer depends on the vendored @agentmesh/sdk.
The 3 crypto operations it actually used (identity gen, ed25519
sign/verify) now run on Node.js native crypto via helpers re-exported
from @azureclaw/mesh:
- generateIdentity()
- verifyEd25519Signature()
- (signing private key handed to createMeshTransport)

Tool-policy evaluation is now an inline ~12-row allow/deny Map; KNOCK
gate just consults that. The router-native /agt/evaluate endpoint
remains the source of truth for full policy semantics.

Dead code removed: trustStore, auditLogger, AgentMeshClient,
MemoryStorage, dual-provider swap branch.

Tests: runtimes/openclaw 118/118, mesh-plugin 63/63 green.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* chore: scrub vendored mesh-provider refs from code + config (Phase 5.2)

- cli: drop @agentmesh/sdk from package.json + lockfile (no source ref)
- helm values + controller-deployment: collapse mesh.provider doc to
  AGT-only, vendored opt-out removed
- sandbox entrypoint: hardcode AZURECLAW_MESH_PROVIDER=agt, remove
  vendored case branch
- mesh-plugin: refresh transport-interface phase comments, drop
  vendored fallback wording in agt-transport / index
- inference-router: refresh mesh.rs / mod.rs doc comments
- patch-nemoclaw.sh: remove vendored SDK overlay step
- runtime, conformance, docker-compose: refresh historical comments

Builds: mesh-plugin + runtimes/openclaw + cli all green.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* docs: scrub vendored mesh-provider refs from public docs (Phase 5.2)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* sandbox: auto-promote AGT file_transfers to workspace root + prompt nudge

When an agent receives a file via AGT mesh file_transfer, the runtime
already saves it to /sandbox/.openclaw/workspace/incoming/. The LLM,
however, doesn't always think to look in incoming/ and tends to fall
back to generating placeholder assets when real ones are present.

Two surgical fixes:

1. runtime: after writing to incoming/, also copy the file to
   workspace root (best-effort, only when not already present).
   Mirrors the existing handoff:workspace_inject auto-promote
   behavior (~line 1314). Inbox entries now carry an extra
   workspace_path field so the agent sees both locations.

2. sandbox system prompt: add an explicit 'Files received from
   other agents' section instructing the model to check workspace
   root + incoming/ before synthesizing placeholders.

Observed in demo: writer transferred the executive_brief.md + hero
PNG + scorecard PNG to the orchestrator via mesh; orchestrator
generated a placeholder PDF instead of using the real assets.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* cli(dev): first-run UX polish + 'connect --reset' for gateway lockout

* dev: re-show the provider picker on first run even when creds exist,
  so users can switch between Copilot / Foundry / Models without first
  having to wipe ~/.azureclaw. If picked provider matches existing creds,
  offer a reuse confirm; otherwise drop into the same prompt flow that
  'azureclaw credentials' uses, forced to the chosen provider.
* config: don't process.exit(1) on Foundry verify probe failures. The
  probe targets the classic AOAI deployments path which doesn't exist on
  project-scoped Foundry endpoints — a 404/401 there did not mean creds
  were invalid, but it left nothing saved and re-prompted forever. Now
  we warn loudly, save what we have, and let the runtime surface the real
  error at use time. Adds markFirstRunCompleted() helper.
* connect: 'azureclaw connect <name> --reset' rolling-restarts the
  openclaw deployment to clear the gateway's in-process brute-force
  lockout. The gateway-token Secret is preserved across restarts so the
  printed URL/token stays valid. Helps recover from stale browser tabs
  spamming old tokens after dev/up cycles.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* cli(dev/local-k8s): parity with docker mode + first-run-green fixes

Closes the remaining gaps that made 'azureclaw dev --target local-k8s'
fail on a fresh cluster while docker mode worked:

* AGT SDK tarball: auto-discover microsoft-agent-governance-sdk-*.tgz
  under $AGT_REPO/agent-governance-typescript and pass it as the
  AGT_SDK_TARBALL build-arg. Without this, the sandbox image installs
  the stock @microsoft/agent-governance-sdk@^3.5.0 from npm, which is
  missing MeshClient.registerSelf / autoRegister, so sub-agents never
  POST /v1/agents and mesh discovery silently fails.
* learnEgress: default ClawSandbox 'networkPolicy.learnEgress = true'
  in dev mode so the forward proxy logs new domains instead of blocking
  them. Without this, Telegram/Slack/Discord channels fail at startup
  with 'Network request for deleteMyCommands failed'. Operators promote
  learned domains via 'azureclaw policy allow' once happy.
* TELEGRAM_ALLOW_FROM: mirror docker mode's resolveChannelTokens flow
  and pull the saved allow-list from 'azureclaw credentials' into the
  '<name>-credentials' Secret. Without this, local-k8s sandboxes started
  Telegram unrestricted (any chat could DM the bot) while docker mode
  honoured the allow-list.
* governance on parent: emit 'spec.governance.enabled: true' in the
  dev YAML so the controller injects AGT_RELAY_URL / AGT_REGISTRY_URL /
  AGT_GOVERNANCE_ENABLED into both containers. Sub-agents are auto-enabled
  by the router spawn helper; the parent must be enabled in the source
  YAML because nothing else turns it on.
* ToolPolicy stub: emit a permissive default '<name>-toolpolicy' in
  the bundle. The router unconditionally injects
  governance.toolPolicyRef = '<parent>-toolpolicy' into spawned sub-agent
  CRs; without this stub every spawn lands in Degraded with
  ToolPolicyNotFound.
* FOUNDRY_PROJECT_ENDPOINT: emit via chart 'foundry.projectEndpoint'
  value (which the controller-deployment template handles) instead of
  duplicating it in 'extraEnv' — server-side apply rejected the latter
  with 'duplicate entries for key'.
* gateway token discovery in startSandboxConnect: read the
  'gateway-token' Secret instead of 'kubectl exec cat /tmp/gateway-token'.
  The exec path is blocked by the ValidatingAdmissionPolicy
  'azureclaw-sandbox-exec-ban' and silently 403s, timing out after 3 min
  even though the gateway is up. Matches how 'azureclaw connect' reads it.
* headlamp chart pinned to 0.41.0: the AzureClaw plugin is built against
  @kinvolk/headlamp-plugin ^0.13.0 and depends on a specific pluginLib
  API surface (KubeObject + SimpleTable + SectionBox + Link). 0.42+
  drifts enough to break the plugin's sidebar/list views. Bump
  intentionally after re-testing.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* tools(headlamp-plugin): redesign as AzureClaw operator dashboard

Rewrites the headlamp plugin from a broken 254-line stub into a
~700-line operator dashboard for teams running AzureClaw on AKS.

Bug fix:
* 'class extends KubeObject' never wires up apiEndpoint in Headlamp
  0.13's plugin API, so every list view rendered 'Error loading
  clawsandboxes'. Switched to the documented 'makeCustomResourceClass'
  factory (same pattern used by Headlamp's own flux + karpenter
  plugins) and avoid the broken ResourceListView+TableFromResourceClass
  code path by using cls.useList() + SimpleTable directly.

Operator dashboard ('/azureclaw'):
* 11 stat tiles: total sandboxes, by phase (running / pending /
  degraded), egress mode counts (learn vs strict default-correct),
  channel count, runtime mix, inference + tool policies, memories,
  MCP servers, A2A agents.
* Sandboxes-by-phase / runtimes / channels-in-use breakdown tables.
* 'Recent Sandboxes' table with model resolution (inline or via
  InferencePolicy ref) and egress mode that matches the controller
  default ('absent block ⇒ Learn').

CRD coverage (9 CRDs, sidebar + list + detail):
* ClawSandbox, InferencePolicy, ToolPolicy, ClawMemory, McpServer,
  A2aAgent, EgressAllowlist, EgressApproval, IngressPolicy.

ClawSandbox detail extras:
* Network Policy card with controller-matched defaults
* Channels card — detects Telegram/Slack/Discord/WhatsApp from the
  '<name>-credentials' Secret in 'azureclaw-<name>' (Source: Secret)
  *and* from spec.channels (Source: Spec)
* Related Resources card — linked InferencePolicy, ToolPolicy,
  ClawMemory, McpServers
* Mesh card (governance enabled, registry mode, trust threshold)
* Deep links to Pod and Workspace ConfigMap in the sandbox pod ns

Other:
* shortModel() helper: strips provider prefix from LiteLLM-style
  identifiers so 'azure/gpt-5.4' and a plain InferencePolicy
  deployment 'gpt-5.4' both render the same.
* Sub-agent model resolution via spec.inferenceRef → InferencePolicy
  lookup (sub-agents have empty 'runtime.openclaw.config').
* Add tsconfig.json (was missing) — extends Headlamp's default
  plugins-tsconfig so JSX compiles.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

---------

Co-authored-by: Pal Lakatos-Toth <pallakatos@github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
pallakatos added a commit to Azure/kars that referenced this pull request May 12, 2026
* feat(mesh): Phase 5 — AGT as default + local-k8s mesh deploy

Flip the default mesh provider from "vendored" to "agt" across the
entire stack. The dual-provider plumbing (Provider enum, factory,
both manifests, both adapters) stays — only the default changes, so
operators can still opt back via AZURECLAW_MESH_PROVIDER=vendored or
`--mesh-provider=vendored`.

Default flipped in:
  * deploy/helm/azureclaw/values.yaml + controller-deployment.yaml
  * controller/src/mesh_peer/mod.rs (Provider::from_env)
  * controller/src/reconciler/mod.rs (sandbox env propagation)
  * sandbox-images/openclaw/entrypoint.sh
  * cli/src/commands/{dev,up,push}.ts (+ subcommands)
  * runtimes/openclaw/src/{index.ts,core/agt-heartbeat.ts}
  * mesh-plugin/src/transport-factory.ts (resolveMeshProvider)

Local-k8s mesh deployment (Phase 5.3):
The kind path previously helm-installed the controller but never
deployed agentmesh-relay/registry, so the controller looped on
`agentmesh-relay:8765` not resolving. Now `runLocalK8s` builds
the relay+registry images (AGT Python from --agt-repo, or vendored
Rust from vendor/), loads them into kind, rewrites the manifest's
ACR image refs to local tags + imagePullPolicy=Never, applies, and
waits for both rollouts before the controller check. Adds
`--no-mesh` opt-out for pure controller smoke tests.

Test updates:
  * mesh-plugin/src/transport-factory.test.ts — defaults flipped,
    vendored opt-in path covered.

Verified: cargo build --release ✓, cargo test --all ✓ (492+),
cli typecheck ✓, cli vitest 640/640 ✓, mesh-plugin 98/98 ✓,
runtimes/openclaw build ✓.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* docs: Phase 5 — AGT-default callouts in CHANGELOG + agt-vs-vendored

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* fix(dev): honor --global-registry in --target local-k8s

The Phase 5.3 deployAgentMesh() helper ran unconditionally, which
would have stood up a second local relay+registry on top of an
already-reachable external one — wasteful at best, port-conflicty
at worst. Skip the in-kind deployment when --global-registry is
set (or piped through from 'azureclaw mesh promote --port-forward').

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* fix(dev): replace regex with String.replaceAll for ACR image rewrites

CodeQL flagged the new RegExp constructors in deployAgentMesh as
'missing regular expression anchor' and 'incomplete hostname regexp'
— the dots in 'azureclawacr.azurecr.io' aren't escaped, so a
malicious hostname could match. Functionally fine for our manifest
(only ACR strings present), but switching to plain String.replaceAll
removes the smell entirely.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* refactor(mesh-plugin): drop @agentmesh/sdk dependency (Phase 5.2)

- identity.ts uses node:crypto for Ed25519 + X25519 (no sodium-native fork)
- transport-factory collapsed to AGT-only; legacy 'vendored' env ignored
- delete vendored MeshConnection adapter + type shim + dead tests
- ci.yml / ci-gates.yml / Makefile / Cargo.toml drop vendor/agentmesh refs
- sandbox Dockerfile drops vendored-SDK overlay (npm @microsoft/agent-governance-sdk only)

63 mesh-plugin vitest tests pass; vendored fork removal continues in
follow-up commits (CLI + controller + runtime + docs).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* refactor(cli): drop vendored mesh-provider branches (Phase 5.2)

Following AGT PR microsoft/agent-governance-toolkit#2090 merging upstream
(commit 6c3af4c on 2026-05-11), all 5 gap-closing fixes (G1–G5) plus
event hooks + RegistryClient + heartbeat + Ed25519 verify are now in
the published @microsoft/agent-governance-sdk. The vendored Rust
relay/registry and patched TS SDK fork are no longer needed.

CLI cleanup:
- dev.ts: remove vendored provider prompt, build branches, postgres
  startup, env-var branching, run dispatch; AGT is the only choice.
- dev/local-k8s.ts: deployAgentMesh() is AGT-only; manifest path,
  image build, image rewrite all collapse to single branch.
- mesh/health.ts: drop /v1/health + WS-on-/ vendored fallbacks.
- mesh/provider.ts: deleted (live vendored↔AGT switcher pointless).
- push.ts + push.test.ts: relay/registry images dropped (sandbox image
  list 6→4); tests updated; mesh images now built only via
  azureclaw push --only relay/--only registry from the AGT repo.
- up.ts + up/agentmesh_deploy.ts: vendored buildPush + postgres ACR
  import + db-credentials secret all gone; only agentmesh-agt.yaml
  manifest applied.
- sandbox-hardening.test.ts: drop /opt/azureclaw-vendored-sdk read-only
  assertion (vendored overlay no longer in Dockerfile).

Vendor cleanup:
- vendor/agentmesh-{sdk,relay,registry}/ — deleted entirely
- ci/vendored-patch-audit.sh — deleted
- deploy/agentmesh.yaml — deleted (only agentmesh-agt.yaml remains)

639/639 CLI tests pass. tsc --noEmit clean.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* refactor(controller,router): drop vendored mesh-provider plumbing (Phase 5.2)

- controller: remove Provider::Vendored enum + handle_vendored_frame
- inference-router: collapse mesh signing/routes to AGT-only
- net -397 LOC

cargo build/test/clippy/fmt all green (1361 tests).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* refactor(runtime): drop @agentmesh/sdk, use node:crypto via @azureclaw/mesh (Phase 5.2)

OpenClaw runtime no longer depends on the vendored @agentmesh/sdk.
The 3 crypto operations it actually used (identity gen, ed25519
sign/verify) now run on Node.js native crypto via helpers re-exported
from @azureclaw/mesh:
- generateIdentity()
- verifyEd25519Signature()
- (signing private key handed to createMeshTransport)

Tool-policy evaluation is now an inline ~12-row allow/deny Map; KNOCK
gate just consults that. The router-native /agt/evaluate endpoint
remains the source of truth for full policy semantics.

Dead code removed: trustStore, auditLogger, AgentMeshClient,
MemoryStorage, dual-provider swap branch.

Tests: runtimes/openclaw 118/118, mesh-plugin 63/63 green.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* chore: scrub vendored mesh-provider refs from code + config (Phase 5.2)

- cli: drop @agentmesh/sdk from package.json + lockfile (no source ref)
- helm values + controller-deployment: collapse mesh.provider doc to
  AGT-only, vendored opt-out removed
- sandbox entrypoint: hardcode AZURECLAW_MESH_PROVIDER=agt, remove
  vendored case branch
- mesh-plugin: refresh transport-interface phase comments, drop
  vendored fallback wording in agt-transport / index
- inference-router: refresh mesh.rs / mod.rs doc comments
- patch-nemoclaw.sh: remove vendored SDK overlay step
- runtime, conformance, docker-compose: refresh historical comments

Builds: mesh-plugin + runtimes/openclaw + cli all green.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* docs: scrub vendored mesh-provider refs from public docs (Phase 5.2)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* sandbox: auto-promote AGT file_transfers to workspace root + prompt nudge

When an agent receives a file via AGT mesh file_transfer, the runtime
already saves it to /sandbox/.openclaw/workspace/incoming/. The LLM,
however, doesn't always think to look in incoming/ and tends to fall
back to generating placeholder assets when real ones are present.

Two surgical fixes:

1. runtime: after writing to incoming/, also copy the file to
   workspace root (best-effort, only when not already present).
   Mirrors the existing handoff:workspace_inject auto-promote
   behavior (~line 1314). Inbox entries now carry an extra
   workspace_path field so the agent sees both locations.

2. sandbox system prompt: add an explicit 'Files received from
   other agents' section instructing the model to check workspace
   root + incoming/ before synthesizing placeholders.

Observed in demo: writer transferred the executive_brief.md + hero
PNG + scorecard PNG to the orchestrator via mesh; orchestrator
generated a placeholder PDF instead of using the real assets.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* cli(dev): first-run UX polish + 'connect --reset' for gateway lockout

* dev: re-show the provider picker on first run even when creds exist,
  so users can switch between Copilot / Foundry / Models without first
  having to wipe ~/.azureclaw. If picked provider matches existing creds,
  offer a reuse confirm; otherwise drop into the same prompt flow that
  'azureclaw credentials' uses, forced to the chosen provider.
* config: don't process.exit(1) on Foundry verify probe failures. The
  probe targets the classic AOAI deployments path which doesn't exist on
  project-scoped Foundry endpoints — a 404/401 there did not mean creds
  were invalid, but it left nothing saved and re-prompted forever. Now
  we warn loudly, save what we have, and let the runtime surface the real
  error at use time. Adds markFirstRunCompleted() helper.
* connect: 'azureclaw connect <name> --reset' rolling-restarts the
  openclaw deployment to clear the gateway's in-process brute-force
  lockout. The gateway-token Secret is preserved across restarts so the
  printed URL/token stays valid. Helps recover from stale browser tabs
  spamming old tokens after dev/up cycles.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* cli(dev/local-k8s): parity with docker mode + first-run-green fixes

Closes the remaining gaps that made 'azureclaw dev --target local-k8s'
fail on a fresh cluster while docker mode worked:

* AGT SDK tarball: auto-discover microsoft-agent-governance-sdk-*.tgz
  under $AGT_REPO/agent-governance-typescript and pass it as the
  AGT_SDK_TARBALL build-arg. Without this, the sandbox image installs
  the stock @microsoft/agent-governance-sdk@^3.5.0 from npm, which is
  missing MeshClient.registerSelf / autoRegister, so sub-agents never
  POST /v1/agents and mesh discovery silently fails.
* learnEgress: default ClawSandbox 'networkPolicy.learnEgress = true'
  in dev mode so the forward proxy logs new domains instead of blocking
  them. Without this, Telegram/Slack/Discord channels fail at startup
  with 'Network request for deleteMyCommands failed'. Operators promote
  learned domains via 'azureclaw policy allow' once happy.
* TELEGRAM_ALLOW_FROM: mirror docker mode's resolveChannelTokens flow
  and pull the saved allow-list from 'azureclaw credentials' into the
  '<name>-credentials' Secret. Without this, local-k8s sandboxes started
  Telegram unrestricted (any chat could DM the bot) while docker mode
  honoured the allow-list.
* governance on parent: emit 'spec.governance.enabled: true' in the
  dev YAML so the controller injects AGT_RELAY_URL / AGT_REGISTRY_URL /
  AGT_GOVERNANCE_ENABLED into both containers. Sub-agents are auto-enabled
  by the router spawn helper; the parent must be enabled in the source
  YAML because nothing else turns it on.
* ToolPolicy stub: emit a permissive default '<name>-toolpolicy' in
  the bundle. The router unconditionally injects
  governance.toolPolicyRef = '<parent>-toolpolicy' into spawned sub-agent
  CRs; without this stub every spawn lands in Degraded with
  ToolPolicyNotFound.
* FOUNDRY_PROJECT_ENDPOINT: emit via chart 'foundry.projectEndpoint'
  value (which the controller-deployment template handles) instead of
  duplicating it in 'extraEnv' — server-side apply rejected the latter
  with 'duplicate entries for key'.
* gateway token discovery in startSandboxConnect: read the
  'gateway-token' Secret instead of 'kubectl exec cat /tmp/gateway-token'.
  The exec path is blocked by the ValidatingAdmissionPolicy
  'azureclaw-sandbox-exec-ban' and silently 403s, timing out after 3 min
  even though the gateway is up. Matches how 'azureclaw connect' reads it.
* headlamp chart pinned to 0.41.0: the AzureClaw plugin is built against
  @kinvolk/headlamp-plugin ^0.13.0 and depends on a specific pluginLib
  API surface (KubeObject + SimpleTable + SectionBox + Link). 0.42+
  drifts enough to break the plugin's sidebar/list views. Bump
  intentionally after re-testing.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* tools(headlamp-plugin): redesign as AzureClaw operator dashboard

Rewrites the headlamp plugin from a broken 254-line stub into a
~700-line operator dashboard for teams running AzureClaw on AKS.

Bug fix:
* 'class extends KubeObject' never wires up apiEndpoint in Headlamp
  0.13's plugin API, so every list view rendered 'Error loading
  clawsandboxes'. Switched to the documented 'makeCustomResourceClass'
  factory (same pattern used by Headlamp's own flux + karpenter
  plugins) and avoid the broken ResourceListView+TableFromResourceClass
  code path by using cls.useList() + SimpleTable directly.

Operator dashboard ('/azureclaw'):
* 11 stat tiles: total sandboxes, by phase (running / pending /
  degraded), egress mode counts (learn vs strict default-correct),
  channel count, runtime mix, inference + tool policies, memories,
  MCP servers, A2A agents.
* Sandboxes-by-phase / runtimes / channels-in-use breakdown tables.
* 'Recent Sandboxes' table with model resolution (inline or via
  InferencePolicy ref) and egress mode that matches the controller
  default ('absent block ⇒ Learn').

CRD coverage (9 CRDs, sidebar + list + detail):
* ClawSandbox, InferencePolicy, ToolPolicy, ClawMemory, McpServer,
  A2aAgent, EgressAllowlist, EgressApproval, IngressPolicy.

ClawSandbox detail extras:
* Network Policy card with controller-matched defaults
* Channels card — detects Telegram/Slack/Discord/WhatsApp from the
  '<name>-credentials' Secret in 'azureclaw-<name>' (Source: Secret)
  *and* from spec.channels (Source: Spec)
* Related Resources card — linked InferencePolicy, ToolPolicy,
  ClawMemory, McpServers
* Mesh card (governance enabled, registry mode, trust threshold)
* Deep links to Pod and Workspace ConfigMap in the sandbox pod ns

Other:
* shortModel() helper: strips provider prefix from LiteLLM-style
  identifiers so 'azure/gpt-5.4' and a plain InferencePolicy
  deployment 'gpt-5.4' both render the same.
* Sub-agent model resolution via spec.inferenceRef → InferencePolicy
  lookup (sub-agents have empty 'runtime.openclaw.config').
* Add tsconfig.json (was missing) — extends Headlamp's default
  plugins-tsconfig so JSX compiles.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

---------

Co-authored-by: Pal Lakatos-Toth <pallakatos@github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
MohammadHaroonAbuomar pushed a commit to MohammadHaroonAbuomar/agt-acs that referenced this pull request Jun 1, 2026
…ify, heartbeat (microsoft#2090)

* feat(mesh-client): add onError, onDisconnect, onE2EVerified event hooks

Adds three observer-registration methods to MeshClient to align with the
AzureClaw vendored AgentMesh SDK surface so consumers can swap providers
behind a single transport interface:

  onError(handler)        — fires on ws errors and decrypt failures
  onDisconnect(handler)   — fires on ws.close with reason+code
  onE2EVerified(handler)  — fires on first successful decrypt per peer

Pure additions; no behaviour change for existing flows. Decrypt-failure
and missing-session paths now drop the message AND notify error
observers instead of silently dropping (still safe — unhandled events
are swallowed).

Tests: 8 new in mesh-client-event-hooks.test.ts; full suite 387/387 green.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* feat(mesh-client): close gaps G1 (KNOCK auto-bootstrap) and G2 (auto-reconnect)

Two protocol-level gaps were identified during the AzureClaw vendored-SDK
audit (see Azure/kars#245 docs/agt-vs-vendored-sdk.md). Both block
moving fully off the vendored agentmesh-sdk fork onto upstream AGT.

## G1: receiver-side X3DH auto-bootstrap from KNOCK

Before: the responder side of an encrypted session was created only by
calling acceptSession(peerId, establishment) explicitly. But the
ChannelEstablishment was never carried on the wire — neither in
knock_accept nor in the first message frame — so the receiver had no way
to obtain it. Result: any fresh encrypted session failed with
"No encrypted session" the first time a ciphertext arrived.

Mirrors vendored agentmesh-sdk patch #4b. Behavior:

- Sender (establishSession): builds the SecureChannel first, then sends
  KNOCK with the ChannelEstablishment embedded as
  { ik: base64, ek: base64, otk?: number }.
- Receiver (handleKnock): if the knock contains `establishment` AND no
  prior session exists, deserialize and call acceptSession() automatically
  before responding with knock_accept. On bad establishment data, the
  knock is rejected and onError("knock", ...) fires.
- Backwards-compatible: legacy peers that don't embed establishment
  continue to work; caller must invoke acceptSession() manually as before.

## G2: auto-reconnect on transport drop

Before: MeshClient.reconnect() existed but was never called automatically.
Network blips, relay restarts, AKS node OOM-evicts left the client
disconnected forever — agents went mesh-deaf.

Mirrors vendored agentmesh-sdk patch microsoft#9. Behavior:

- New options: autoReconnect (default true), maxReconnectAttempts
  (default Number.POSITIVE_INFINITY), reconnectBaseDelayMs (default
  1000), reconnectMaxDelayMs (default 60000).
- On non-1000 ws.onclose: schedule reconnect with exponential backoff
  capped at reconnectMaxDelayMs. Light jitter (±20%) avoids
  thundering-herd reconnects across many sandboxes.
- onDisconnect handlers fire BEFORE the reconnect is scheduled so
  observers can see the drop.
- After maxReconnectAttempts, fires onError("ws", ..., "auto-reconnect
  gave up after N attempts") and stops.
- disconnect() cancels any pending reconnect timer.
- A connect() failure inside the reconnect path schedules another retry
  via the existing onclose path (and recursively from the catch handler
  in scheduleReconnect for the case where ws.onopen never fired).

## Tests

11 new tests across two files:
- tests/mesh-client-knock-bootstrap.test.ts: 5 tests covering sender-side
  embedding, receiver auto-bootstrap, legacy fallback, malformed
  establishment, and end-to-end happy-path encrypted send.
- tests/mesh-client-auto-reconnect.test.ts: 6 tests covering server-close
  reconnect, client-close no-reconnect, opt-out, give-up after max
  attempts, disconnect cancels pending timer, no duplicate scheduling.

Full AGT TS suite: 398/398 pass (was 387 before this commit). Build clean.

Both gaps were identified in the AzureClaw audit doc:
docs/agt-vs-vendored-sdk.md (Patch-by-patch audit section).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* feat(mesh): close vendored-parity gaps G3, G4, G5

G3 — session-desync teardown (vendored agentmesh-sdk patch microsoft#13 equiv):
when Double Ratchet decryption fails for an existing session, the local
ratchet state is irrecoverable. Previously we only fired
onError('decrypt'); now we tear down the session (delete from
this.sessions; clear knockAccepted) and fire a distinct
onError('session_desync') so callers can re-run establishSession() and
resume communication.

G4 — pre-KNOCK encrypted-message buffer (vendored agentmesh-sdk
patch microsoft#16 equiv): under transport reordering the relay can deliver an
encrypted 'message' frame BEFORE the matching 'knock' arrives. Without
buffering, that first frame was dropped silently. Now MeshClient buffers
encrypted frames per-peer (default cap 5, TTL 3000ms), replays them
when the corresponding knock is accepted, and drops them when rejected
or on disconnect. Disabled by setting preKnockBufferSize: 0.

G5 — eager ghost-connection close on rebind (vendored agentmesh-relay
patch microsoft#2 equiv): when an agent reconnects with the same DID, the prior
'ghost' WebSocket is now closed eagerly with code 1000 'session_replaced'
instead of waiting for the 90s heartbeat-eviction timer. The 'finally'
cleanup now compares socket identity to avoid removing the freshly
rebound connection when the old socket's handler unwinds.

Tests: +5 (G4 pre-knock buffer behaviour), +2 (G3 type + lifecycle
safety), +1 (G5 ghost rebind). 405 TS tests pass; 18 relay tests pass.

NOTE: held local-only on branch azureclaw-meshclient-event-hooks; will
be coordinated upstream with the AGT team.

* feat(mesh): add RegistryClient and auto-register on MeshClient.connect()

Closes the structural gap where MeshClientOptions declared registryUrl
but no code path ever used it. Without this, agents can connect to the
relay and exchange ciphertext but are invisible to peers — they never
appear in /v1/discover and no one can fetch their X3DH pre-key bundle.

This commit adds the missing registry-side glue, in upstream-quality
shape, so AGT MeshClient becomes a drop-in replacement for vendored
forks that wired their own registry calls (e.g. AzureClaw vendor/agentmesh-sdk).

What's new:

- src/encryption/registry-client.ts: typed HTTP client wrapping the
  AgentMesh registry's REST surface (POST /v1/agents,
  PUT/GET /v1/agents/{did}/prekeys, GET /v1/discover, GET/DELETE
  /v1/agents/{did}). base64url <-> Uint8Array marshalling, optional
  Ed25519-Timestamp Authorization signer, configurable retry/timeout.

- src/encryption/mesh-client.ts:
  * New options: registryClient, registryClientOptions, capabilities,
    registrationMetadata, oneTimePrekeyCount, autoRegister.
  * connect() now calls registerSelf() at the end of a successful WS
    handshake when autoRegister is true (default) and a registry is
    configured. Idempotent — 409 (already registered) is treated as
    success; reconnect doesn't re-register.
  * registerSelf(): generates signed-prekey + N one-time prekeys via
    keyManager, then POSTs /v1/agents (capabilities = [displayName,
    ...options.capabilities]) and PUTs the prekey bundle.
  * discover(capability): wraps registry.discover.
  * establishSessionWithPeer(peerId): fetches peer prekeys from the
    registry and calls existing establishSession.
  * getRegistry(): exposes the underlying client for advanced cases.

- tests/registry-client.test.ts: 11 new tests covering wire format,
  base64url round-trip, idempotent register (409), 5xx retry, and
  Authorization header. Also covers MeshClient auto-register on connect
  end-to-end with a fake fetch + fake WebSocket.

- tests/mesh-client-*.test.ts: existing fixtures updated to pass
  autoRegister: false (no registry available in those tests). Behavior
  unchanged for the production path.

All 71 tests pass (60 previously + 11 new).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* feat(registry): add POST /v1/agents/{did}/heartbeat to bump last_seen

The registry's last_seen field is set on registration but never
updated by any HTTP handler — update_last_seen() exists in
store.py but is dead code in production. As a result, every
registered agent looks 'offline' (online=false on /presence) 90s
after spawn, and any client-side discover stale-filter rejects
all live agents.

Add an unauthenticated POST /v1/agents/{did}/heartbeat that calls
store.update_last_seen(). Idempotent. Returns 404 for unknown DIDs
so callers can detect a registry restart and re-register.

Mirrors the agent-side ping cadence (30s, well within the 90s
presence threshold).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* fix(mesh): include Ed25519 identity_key_ed in prekey bundle so verifyBundle() works

PreKeyBundle.identityKeyEd was previously aliased to the X25519 identity_key
because the registry didn't store the Ed25519 signing key. As a result,
verifyBundle() either passed only because the stub test never exercised it,
or quietly accepted any signed pre-key — a soundness gap in the X3DH wrapper.

This commit threads identityKeyEd through the full path:

- agent-mesh/registry/store.py (AgentRecord): add identity_key_ed field
  (X25519 long-term key already present as identity_key; Ed25519 distinct).
- agent-governance-typescript/src/encryption/x3dh.ts: expose
  X3DHKeyManager.identityKeyEd getter.
- agent-governance-typescript/src/encryption/registry-client.ts:
  uploadPrekeys() now accepts identityKeyEd (32 bytes, validated) and
  serializes it as identity_key_ed; fetchPrekeys() deserializes it back into
  PreKeyBundle.identityKeyEd. Older bundles without the field fall back to
  the X25519 key — verifyBundle() will then correctly reject (forensic
  visibility, no silent bypass).
- agent-governance-typescript/src/encryption/mesh-client.ts: registerSelf()
  passes keyManager.identityKeyEd through to the registry.
- tests/registry-client.test.ts + tests/test_registry.py: assert the new
  field is round-tripped on PUT/GET.

Wire compatibility: PUT now requires identity_key_ed; GET emits it when set.
Existing AGT clients that don't upload it will get verifyBundle() rejection
on the receiving side — peers must upgrade together. (No silent regression
on the wire — the field is new, not a breaking change to existing fields.)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* test(mesh): set autoRegister:false in upstream knock + malformed-frame tests

These two upstream tests construct MeshClient without autoRegister:false,
which now triggers a real fetch to the registry on connect() (added by
our registry-client commit). The tests have no fake registry, so they
fail with 'fetch failed'. autoRegister:false skips the auto-registration
path and matches the pattern used in our 6 sibling mesh-client-*.test.ts
files.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

---------

Co-authored-by: Pal Lakatos-Toth <pallakatos@microsoft.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
MohammadHaroonAbuomar pushed a commit to MohammadHaroonAbuomar/agt-acs that referenced this pull request Jun 1, 2026
…th (microsoft#2092)

- Wrap messageHandlers loop in try/catch to match errorHandlers/disconnectHandlers
  (prevents one throwing handler from killing subsequent handlers)
- Validate identity_key_ed is exactly 32 bytes after base64 decode
  (rejects garbage bytes that would break downstream Ed25519 verification)
- Require reporter_amid to be a registered session participant on
  reputation/session endpoint (prevents anonymous reputation manipulation)

Follow-up to microsoft#2090.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
MohammadHaroonAbuomar pushed a commit to MohammadHaroonAbuomar/agt-acs that referenced this pull request Jun 1, 2026
…ail-fast (microsoft#2093)

* fix(mesh): harden message handlers, key validation, and reputation auth

- Wrap messageHandlers loop in try/catch to match errorHandlers/disconnectHandlers
  (prevents one throwing handler from killing subsequent handlers)
- Validate identity_key_ed is exactly 32 bytes after base64 decode
  (rejects garbage bytes that would break downstream Ed25519 verification)
- Require reporter_amid to be a registered session participant on
  reputation/session endpoint (prevents anonymous reputation manipulation)

Follow-up to microsoft#2090.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* fix(mesh): global buffer cap, heartbeat rate limit, identity_key_ed validation

- Add maxBufferedPeers option (default 100) to MeshClient to prevent
  memory exhaustion from adversary sending from many distinct DIDs
- Rate-limit heartbeat endpoint to 1 per 10s per agent (atomic, via
  store-level try_update_last_seen) to prevent stale agent keepalive
- Reject fetchPrekeys when peer bundle lacks identity_key_ed instead
  of silently substituting X25519 key (fail-fast on incompatible peers)
- Add tests for all three hardening measures plus identity_key_ed
  length validation and session reputation participant check

TS: 429/429 tests pass. Python: 25/25 tests pass.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent-mesh agent-mesh package size/XL Extra large PR (500+ lines) tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants