Skip to content

feat(mesh): Phase 5 — AGT as default + local-k8s mesh deployment#246

Merged
pallakatos merged 14 commits into
devfrom
phase5-agt-default
May 12, 2026
Merged

feat(mesh): Phase 5 — AGT as default + local-k8s mesh deployment#246
pallakatos merged 14 commits into
devfrom
phase5-agt-default

Conversation

@pallakatos

Copy link
Copy Markdown
Collaborator

Summary

Flips the default mesh provider from vendored to agt across the entire stack, and wires AGT relay+registry deployment into the azureclaw dev --target local-k8s flow so the kind path no longer leaves the controller in a WS reconnect loop.

Both the dual-provider plumbing (factory, Provider enum, both manifests, both adapters) and the patched vendored fork in vendor/ stay in place — operators can still opt back via AZURECLAW_MESH_PROVIDER=vendored or --mesh-provider=vendored while upstream AGT catches up to our full patch set.

What's in here

Default flipped to agt in:

  • deploy/helm/azureclaw/values.yaml + controller-deployment.yaml
  • controller/src/mesh_peer/mod.rs (Provider::from_env)
  • controller/src/reconciler/mod.rs (sandbox env propagation)
  • sandbox-images/openclaw/entrypoint.sh
  • cli/src/commands/{dev,up,push}.ts (+ subcommands)
  • runtimes/openclaw/src/{index.ts,core/agt-heartbeat.ts}
  • mesh-plugin/src/transport-factory.ts (resolveMeshProvider)

Phase 5.3 — local-k8s mesh deployment

  • New deployAgentMesh() helper in cli/src/commands/dev/local-k8s.ts builds the AGT (Python from --agt-repo) or vendored (Rust from vendor/) relay + registry images, loads them into kind, rewrites the manifest's ACR image refs to local tags + imagePullPolicy=Never, applies, and waits for both rollouts before the controller check.
  • New --no-mesh flag for controller-only smoke tests.

Tests + docs

  • mesh-plugin/src/transport-factory.test.ts — defaults flipped, vendored opt-in path covered.
  • docs/agt-vs-vendored-sdk.md — Phase 5 header.
  • CHANGELOG.md — unreleased Phase 5 entry.

Verification

  • cargo fmt --all --check
  • cargo build --release (controller + router + workspace)
  • cargo test --release --all (492+ unit + integration)
  • cli typecheck + cli npm test 640/640
  • mesh-plugin npm test 98/98
  • runtimes/openclaw npm run build

⚠️ Merge policy

Per user direction: this is dev-only. Do not promote to main without explicit sign-off.

Follow-up (deferred to Phase 5.2)

Removal of vendor/agentmesh-{sdk,relay,registry}/ is blocked on upstream AGT PR microsoft/agent-governance-toolkit#2090 merging. Until that lands, the vendored fork remains shippable so git pull && azureclaw dev still works for users on the legacy path.

Pal Lakatos-Toth and others added 2 commits May 12, 2026 07:19
Flip the default mesh provider from "vendored" to "agt" across the
entire stack. The dual-provider plumbing (Provider enum, factory,
both manifests, both adapters) stays — only the default changes, so
operators can still opt back via AZURECLAW_MESH_PROVIDER=vendored or
`--mesh-provider=vendored`.

Default flipped in:
  * deploy/helm/azureclaw/values.yaml + controller-deployment.yaml
  * controller/src/mesh_peer/mod.rs (Provider::from_env)
  * controller/src/reconciler/mod.rs (sandbox env propagation)
  * sandbox-images/openclaw/entrypoint.sh
  * cli/src/commands/{dev,up,push}.ts (+ subcommands)
  * runtimes/openclaw/src/{index.ts,core/agt-heartbeat.ts}
  * mesh-plugin/src/transport-factory.ts (resolveMeshProvider)

Local-k8s mesh deployment (Phase 5.3):
The kind path previously helm-installed the controller but never
deployed agentmesh-relay/registry, so the controller looped on
`agentmesh-relay:8765` not resolving. Now `runLocalK8s` builds
the relay+registry images (AGT Python from --agt-repo, or vendored
Rust from vendor/), loads them into kind, rewrites the manifest's
ACR image refs to local tags + imagePullPolicy=Never, applies, and
waits for both rollouts before the controller check. Adds
`--no-mesh` opt-out for pure controller smoke tests.

Test updates:
  * mesh-plugin/src/transport-factory.test.ts — defaults flipped,
    vendored opt-in path covered.

Verified: cargo build --release ✓, cargo test --all ✓ (492+),
cli typecheck ✓, cli vitest 640/640 ✓, mesh-plugin 98/98 ✓,
runtimes/openclaw build ✓.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Comment thread cli/src/commands/dev/local-k8s.ts Fixed
Comment thread cli/src/commands/dev/local-k8s.ts Fixed
Pal Lakatos-Toth and others added 12 commits May 12, 2026 09:56
The Phase 5.3 deployAgentMesh() helper ran unconditionally, which
would have stood up a second local relay+registry on top of an
already-reachable external one — wasteful at best, port-conflicty
at worst. Skip the in-kind deployment when --global-registry is
set (or piped through from 'azureclaw mesh promote --port-forward').

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
CodeQL flagged the new RegExp constructors in deployAgentMesh as
'missing regular expression anchor' and 'incomplete hostname regexp'
— the dots in 'azureclawacr.azurecr.io' aren't escaped, so a
malicious hostname could match. Functionally fine for our manifest
(only ACR strings present), but switching to plain String.replaceAll
removes the smell entirely.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- identity.ts uses node:crypto for Ed25519 + X25519 (no sodium-native fork)
- transport-factory collapsed to AGT-only; legacy 'vendored' env ignored
- delete vendored MeshConnection adapter + type shim + dead tests
- ci.yml / ci-gates.yml / Makefile / Cargo.toml drop vendor/agentmesh refs
- sandbox Dockerfile drops vendored-SDK overlay (npm @microsoft/agent-governance-sdk only)

63 mesh-plugin vitest tests pass; vendored fork removal continues in
follow-up commits (CLI + controller + runtime + docs).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Following AGT PR microsoft/agent-governance-toolkit#2090 merging upstream
(commit 6c3af4c on 2026-05-11), all 5 gap-closing fixes (G1–G5) plus
event hooks + RegistryClient + heartbeat + Ed25519 verify are now in
the published @microsoft/agent-governance-sdk. The vendored Rust
relay/registry and patched TS SDK fork are no longer needed.

CLI cleanup:
- dev.ts: remove vendored provider prompt, build branches, postgres
  startup, env-var branching, run dispatch; AGT is the only choice.
- dev/local-k8s.ts: deployAgentMesh() is AGT-only; manifest path,
  image build, image rewrite all collapse to single branch.
- mesh/health.ts: drop /v1/health + WS-on-/ vendored fallbacks.
- mesh/provider.ts: deleted (live vendored↔AGT switcher pointless).
- push.ts + push.test.ts: relay/registry images dropped (sandbox image
  list 6→4); tests updated; mesh images now built only via
  azureclaw push --only relay/--only registry from the AGT repo.
- up.ts + up/agentmesh_deploy.ts: vendored buildPush + postgres ACR
  import + db-credentials secret all gone; only agentmesh-agt.yaml
  manifest applied.
- sandbox-hardening.test.ts: drop /opt/azureclaw-vendored-sdk read-only
  assertion (vendored overlay no longer in Dockerfile).

Vendor cleanup:
- vendor/agentmesh-{sdk,relay,registry}/ — deleted entirely
- ci/vendored-patch-audit.sh — deleted
- deploy/agentmesh.yaml — deleted (only agentmesh-agt.yaml remains)

639/639 CLI tests pass. tsc --noEmit clean.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ase 5.2)

- controller: remove Provider::Vendored enum + handle_vendored_frame
- inference-router: collapse mesh signing/routes to AGT-only
- net -397 LOC

cargo build/test/clippy/fmt all green (1361 tests).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…w/mesh (Phase 5.2)

OpenClaw runtime no longer depends on the vendored @agentmesh/sdk.
The 3 crypto operations it actually used (identity gen, ed25519
sign/verify) now run on Node.js native crypto via helpers re-exported
from @azureclaw/mesh:
- generateIdentity()
- verifyEd25519Signature()
- (signing private key handed to createMeshTransport)

Tool-policy evaluation is now an inline ~12-row allow/deny Map; KNOCK
gate just consults that. The router-native /agt/evaluate endpoint
remains the source of truth for full policy semantics.

Dead code removed: trustStore, auditLogger, AgentMeshClient,
MemoryStorage, dual-provider swap branch.

Tests: runtimes/openclaw 118/118, mesh-plugin 63/63 green.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- cli: drop @agentmesh/sdk from package.json + lockfile (no source ref)
- helm values + controller-deployment: collapse mesh.provider doc to
  AGT-only, vendored opt-out removed
- sandbox entrypoint: hardcode AZURECLAW_MESH_PROVIDER=agt, remove
  vendored case branch
- mesh-plugin: refresh transport-interface phase comments, drop
  vendored fallback wording in agt-transport / index
- inference-router: refresh mesh.rs / mod.rs doc comments
- patch-nemoclaw.sh: remove vendored SDK overlay step
- runtime, conformance, docker-compose: refresh historical comments

Builds: mesh-plugin + runtimes/openclaw + cli all green.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…udge

When an agent receives a file via AGT mesh file_transfer, the runtime
already saves it to /sandbox/.openclaw/workspace/incoming/. The LLM,
however, doesn't always think to look in incoming/ and tends to fall
back to generating placeholder assets when real ones are present.

Two surgical fixes:

1. runtime: after writing to incoming/, also copy the file to
   workspace root (best-effort, only when not already present).
   Mirrors the existing handoff:workspace_inject auto-promote
   behavior (~line 1314). Inbox entries now carry an extra
   workspace_path field so the agent sees both locations.

2. sandbox system prompt: add an explicit 'Files received from
   other agents' section instructing the model to check workspace
   root + incoming/ before synthesizing placeholders.

Observed in demo: writer transferred the executive_brief.md + hero
PNG + scorecard PNG to the orchestrator via mesh; orchestrator
generated a placeholder PDF instead of using the real assets.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* dev: re-show the provider picker on first run even when creds exist,
  so users can switch between Copilot / Foundry / Models without first
  having to wipe ~/.azureclaw. If picked provider matches existing creds,
  offer a reuse confirm; otherwise drop into the same prompt flow that
  'azureclaw credentials' uses, forced to the chosen provider.
* config: don't process.exit(1) on Foundry verify probe failures. The
  probe targets the classic AOAI deployments path which doesn't exist on
  project-scoped Foundry endpoints — a 404/401 there did not mean creds
  were invalid, but it left nothing saved and re-prompted forever. Now
  we warn loudly, save what we have, and let the runtime surface the real
  error at use time. Adds markFirstRunCompleted() helper.
* connect: 'azureclaw connect <name> --reset' rolling-restarts the
  openclaw deployment to clear the gateway's in-process brute-force
  lockout. The gateway-token Secret is preserved across restarts so the
  printed URL/token stays valid. Helps recover from stale browser tabs
  spamming old tokens after dev/up cycles.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Closes the remaining gaps that made 'azureclaw dev --target local-k8s'
fail on a fresh cluster while docker mode worked:

* AGT SDK tarball: auto-discover microsoft-agent-governance-sdk-*.tgz
  under $AGT_REPO/agent-governance-typescript and pass it as the
  AGT_SDK_TARBALL build-arg. Without this, the sandbox image installs
  the stock @microsoft/agent-governance-sdk@^3.5.0 from npm, which is
  missing MeshClient.registerSelf / autoRegister, so sub-agents never
  POST /v1/agents and mesh discovery silently fails.
* learnEgress: default ClawSandbox 'networkPolicy.learnEgress = true'
  in dev mode so the forward proxy logs new domains instead of blocking
  them. Without this, Telegram/Slack/Discord channels fail at startup
  with 'Network request for deleteMyCommands failed'. Operators promote
  learned domains via 'azureclaw policy allow' once happy.
* TELEGRAM_ALLOW_FROM: mirror docker mode's resolveChannelTokens flow
  and pull the saved allow-list from 'azureclaw credentials' into the
  '<name>-credentials' Secret. Without this, local-k8s sandboxes started
  Telegram unrestricted (any chat could DM the bot) while docker mode
  honoured the allow-list.
* governance on parent: emit 'spec.governance.enabled: true' in the
  dev YAML so the controller injects AGT_RELAY_URL / AGT_REGISTRY_URL /
  AGT_GOVERNANCE_ENABLED into both containers. Sub-agents are auto-enabled
  by the router spawn helper; the parent must be enabled in the source
  YAML because nothing else turns it on.
* ToolPolicy stub: emit a permissive default '<name>-toolpolicy' in
  the bundle. The router unconditionally injects
  governance.toolPolicyRef = '<parent>-toolpolicy' into spawned sub-agent
  CRs; without this stub every spawn lands in Degraded with
  ToolPolicyNotFound.
* FOUNDRY_PROJECT_ENDPOINT: emit via chart 'foundry.projectEndpoint'
  value (which the controller-deployment template handles) instead of
  duplicating it in 'extraEnv' — server-side apply rejected the latter
  with 'duplicate entries for key'.
* gateway token discovery in startSandboxConnect: read the
  'gateway-token' Secret instead of 'kubectl exec cat /tmp/gateway-token'.
  The exec path is blocked by the ValidatingAdmissionPolicy
  'azureclaw-sandbox-exec-ban' and silently 403s, timing out after 3 min
  even though the gateway is up. Matches how 'azureclaw connect' reads it.
* headlamp chart pinned to 0.41.0: the AzureClaw plugin is built against
  @kinvolk/headlamp-plugin ^0.13.0 and depends on a specific pluginLib
  API surface (KubeObject + SimpleTable + SectionBox + Link). 0.42+
  drifts enough to break the plugin's sidebar/list views. Bump
  intentionally after re-testing.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Rewrites the headlamp plugin from a broken 254-line stub into a
~700-line operator dashboard for teams running AzureClaw on AKS.

Bug fix:
* 'class extends KubeObject' never wires up apiEndpoint in Headlamp
  0.13's plugin API, so every list view rendered 'Error loading
  clawsandboxes'. Switched to the documented 'makeCustomResourceClass'
  factory (same pattern used by Headlamp's own flux + karpenter
  plugins) and avoid the broken ResourceListView+TableFromResourceClass
  code path by using cls.useList() + SimpleTable directly.

Operator dashboard ('/azureclaw'):
* 11 stat tiles: total sandboxes, by phase (running / pending /
  degraded), egress mode counts (learn vs strict default-correct),
  channel count, runtime mix, inference + tool policies, memories,
  MCP servers, A2A agents.
* Sandboxes-by-phase / runtimes / channels-in-use breakdown tables.
* 'Recent Sandboxes' table with model resolution (inline or via
  InferencePolicy ref) and egress mode that matches the controller
  default ('absent block ⇒ Learn').

CRD coverage (9 CRDs, sidebar + list + detail):
* ClawSandbox, InferencePolicy, ToolPolicy, ClawMemory, McpServer,
  A2aAgent, EgressAllowlist, EgressApproval, IngressPolicy.

ClawSandbox detail extras:
* Network Policy card with controller-matched defaults
* Channels card — detects Telegram/Slack/Discord/WhatsApp from the
  '<name>-credentials' Secret in 'azureclaw-<name>' (Source: Secret)
  *and* from spec.channels (Source: Spec)
* Related Resources card — linked InferencePolicy, ToolPolicy,
  ClawMemory, McpServers
* Mesh card (governance enabled, registry mode, trust threshold)
* Deep links to Pod and Workspace ConfigMap in the sandbox pod ns

Other:
* shortModel() helper: strips provider prefix from LiteLLM-style
  identifiers so 'azure/gpt-5.4' and a plain InferencePolicy
  deployment 'gpt-5.4' both render the same.
* Sub-agent model resolution via spec.inferenceRef → InferencePolicy
  lookup (sub-agents have empty 'runtime.openclaw.config').
* Add tsconfig.json (was missing) — extends Headlamp's default
  plugins-tsconfig so JSX compiles.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@pallakatos pallakatos marked this pull request as ready for review May 12, 2026 15:24
@pallakatos pallakatos merged commit 2ea5a39 into dev May 12, 2026
21 checks passed
pallakatos added a commit that referenced this pull request May 12, 2026
* feat(mesh): Phase 5 — AGT as default + local-k8s mesh deploy

Flip the default mesh provider from "vendored" to "agt" across the
entire stack. The dual-provider plumbing (Provider enum, factory,
both manifests, both adapters) stays — only the default changes, so
operators can still opt back via AZURECLAW_MESH_PROVIDER=vendored or
`--mesh-provider=vendored`.

Default flipped in:
  * deploy/helm/azureclaw/values.yaml + controller-deployment.yaml
  * controller/src/mesh_peer/mod.rs (Provider::from_env)
  * controller/src/reconciler/mod.rs (sandbox env propagation)
  * sandbox-images/openclaw/entrypoint.sh
  * cli/src/commands/{dev,up,push}.ts (+ subcommands)
  * runtimes/openclaw/src/{index.ts,core/agt-heartbeat.ts}
  * mesh-plugin/src/transport-factory.ts (resolveMeshProvider)

Local-k8s mesh deployment (Phase 5.3):
The kind path previously helm-installed the controller but never
deployed agentmesh-relay/registry, so the controller looped on
`agentmesh-relay:8765` not resolving. Now `runLocalK8s` builds
the relay+registry images (AGT Python from --agt-repo, or vendored
Rust from vendor/), loads them into kind, rewrites the manifest's
ACR image refs to local tags + imagePullPolicy=Never, applies, and
waits for both rollouts before the controller check. Adds
`--no-mesh` opt-out for pure controller smoke tests.

Test updates:
  * mesh-plugin/src/transport-factory.test.ts — defaults flipped,
    vendored opt-in path covered.

Verified: cargo build --release ✓, cargo test --all ✓ (492+),
cli typecheck ✓, cli vitest 640/640 ✓, mesh-plugin 98/98 ✓,
runtimes/openclaw build ✓.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* docs: Phase 5 — AGT-default callouts in CHANGELOG + agt-vs-vendored

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* fix(dev): honor --global-registry in --target local-k8s

The Phase 5.3 deployAgentMesh() helper ran unconditionally, which
would have stood up a second local relay+registry on top of an
already-reachable external one — wasteful at best, port-conflicty
at worst. Skip the in-kind deployment when --global-registry is
set (or piped through from 'azureclaw mesh promote --port-forward').

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* fix(dev): replace regex with String.replaceAll for ACR image rewrites

CodeQL flagged the new RegExp constructors in deployAgentMesh as
'missing regular expression anchor' and 'incomplete hostname regexp'
— the dots in 'azureclawacr.azurecr.io' aren't escaped, so a
malicious hostname could match. Functionally fine for our manifest
(only ACR strings present), but switching to plain String.replaceAll
removes the smell entirely.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* refactor(mesh-plugin): drop @agentmesh/sdk dependency (Phase 5.2)

- identity.ts uses node:crypto for Ed25519 + X25519 (no sodium-native fork)
- transport-factory collapsed to AGT-only; legacy 'vendored' env ignored
- delete vendored MeshConnection adapter + type shim + dead tests
- ci.yml / ci-gates.yml / Makefile / Cargo.toml drop vendor/agentmesh refs
- sandbox Dockerfile drops vendored-SDK overlay (npm @microsoft/agent-governance-sdk only)

63 mesh-plugin vitest tests pass; vendored fork removal continues in
follow-up commits (CLI + controller + runtime + docs).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* refactor(cli): drop vendored mesh-provider branches (Phase 5.2)

Following AGT PR microsoft/agent-governance-toolkit#2090 merging upstream
(commit 6c3af4c on 2026-05-11), all 5 gap-closing fixes (G1–G5) plus
event hooks + RegistryClient + heartbeat + Ed25519 verify are now in
the published @microsoft/agent-governance-sdk. The vendored Rust
relay/registry and patched TS SDK fork are no longer needed.

CLI cleanup:
- dev.ts: remove vendored provider prompt, build branches, postgres
  startup, env-var branching, run dispatch; AGT is the only choice.
- dev/local-k8s.ts: deployAgentMesh() is AGT-only; manifest path,
  image build, image rewrite all collapse to single branch.
- mesh/health.ts: drop /v1/health + WS-on-/ vendored fallbacks.
- mesh/provider.ts: deleted (live vendored↔AGT switcher pointless).
- push.ts + push.test.ts: relay/registry images dropped (sandbox image
  list 6→4); tests updated; mesh images now built only via
  azureclaw push --only relay/--only registry from the AGT repo.
- up.ts + up/agentmesh_deploy.ts: vendored buildPush + postgres ACR
  import + db-credentials secret all gone; only agentmesh-agt.yaml
  manifest applied.
- sandbox-hardening.test.ts: drop /opt/azureclaw-vendored-sdk read-only
  assertion (vendored overlay no longer in Dockerfile).

Vendor cleanup:
- vendor/agentmesh-{sdk,relay,registry}/ — deleted entirely
- ci/vendored-patch-audit.sh — deleted
- deploy/agentmesh.yaml — deleted (only agentmesh-agt.yaml remains)

639/639 CLI tests pass. tsc --noEmit clean.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* refactor(controller,router): drop vendored mesh-provider plumbing (Phase 5.2)

- controller: remove Provider::Vendored enum + handle_vendored_frame
- inference-router: collapse mesh signing/routes to AGT-only
- net -397 LOC

cargo build/test/clippy/fmt all green (1361 tests).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* refactor(runtime): drop @agentmesh/sdk, use node:crypto via @azureclaw/mesh (Phase 5.2)

OpenClaw runtime no longer depends on the vendored @agentmesh/sdk.
The 3 crypto operations it actually used (identity gen, ed25519
sign/verify) now run on Node.js native crypto via helpers re-exported
from @azureclaw/mesh:
- generateIdentity()
- verifyEd25519Signature()
- (signing private key handed to createMeshTransport)

Tool-policy evaluation is now an inline ~12-row allow/deny Map; KNOCK
gate just consults that. The router-native /agt/evaluate endpoint
remains the source of truth for full policy semantics.

Dead code removed: trustStore, auditLogger, AgentMeshClient,
MemoryStorage, dual-provider swap branch.

Tests: runtimes/openclaw 118/118, mesh-plugin 63/63 green.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* chore: scrub vendored mesh-provider refs from code + config (Phase 5.2)

- cli: drop @agentmesh/sdk from package.json + lockfile (no source ref)
- helm values + controller-deployment: collapse mesh.provider doc to
  AGT-only, vendored opt-out removed
- sandbox entrypoint: hardcode AZURECLAW_MESH_PROVIDER=agt, remove
  vendored case branch
- mesh-plugin: refresh transport-interface phase comments, drop
  vendored fallback wording in agt-transport / index
- inference-router: refresh mesh.rs / mod.rs doc comments
- patch-nemoclaw.sh: remove vendored SDK overlay step
- runtime, conformance, docker-compose: refresh historical comments

Builds: mesh-plugin + runtimes/openclaw + cli all green.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* docs: scrub vendored mesh-provider refs from public docs (Phase 5.2)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* sandbox: auto-promote AGT file_transfers to workspace root + prompt nudge

When an agent receives a file via AGT mesh file_transfer, the runtime
already saves it to /sandbox/.openclaw/workspace/incoming/. The LLM,
however, doesn't always think to look in incoming/ and tends to fall
back to generating placeholder assets when real ones are present.

Two surgical fixes:

1. runtime: after writing to incoming/, also copy the file to
   workspace root (best-effort, only when not already present).
   Mirrors the existing handoff:workspace_inject auto-promote
   behavior (~line 1314). Inbox entries now carry an extra
   workspace_path field so the agent sees both locations.

2. sandbox system prompt: add an explicit 'Files received from
   other agents' section instructing the model to check workspace
   root + incoming/ before synthesizing placeholders.

Observed in demo: writer transferred the executive_brief.md + hero
PNG + scorecard PNG to the orchestrator via mesh; orchestrator
generated a placeholder PDF instead of using the real assets.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* cli(dev): first-run UX polish + 'connect --reset' for gateway lockout

* dev: re-show the provider picker on first run even when creds exist,
  so users can switch between Copilot / Foundry / Models without first
  having to wipe ~/.azureclaw. If picked provider matches existing creds,
  offer a reuse confirm; otherwise drop into the same prompt flow that
  'azureclaw credentials' uses, forced to the chosen provider.
* config: don't process.exit(1) on Foundry verify probe failures. The
  probe targets the classic AOAI deployments path which doesn't exist on
  project-scoped Foundry endpoints — a 404/401 there did not mean creds
  were invalid, but it left nothing saved and re-prompted forever. Now
  we warn loudly, save what we have, and let the runtime surface the real
  error at use time. Adds markFirstRunCompleted() helper.
* connect: 'azureclaw connect <name> --reset' rolling-restarts the
  openclaw deployment to clear the gateway's in-process brute-force
  lockout. The gateway-token Secret is preserved across restarts so the
  printed URL/token stays valid. Helps recover from stale browser tabs
  spamming old tokens after dev/up cycles.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* cli(dev/local-k8s): parity with docker mode + first-run-green fixes

Closes the remaining gaps that made 'azureclaw dev --target local-k8s'
fail on a fresh cluster while docker mode worked:

* AGT SDK tarball: auto-discover microsoft-agent-governance-sdk-*.tgz
  under $AGT_REPO/agent-governance-typescript and pass it as the
  AGT_SDK_TARBALL build-arg. Without this, the sandbox image installs
  the stock @microsoft/agent-governance-sdk@^3.5.0 from npm, which is
  missing MeshClient.registerSelf / autoRegister, so sub-agents never
  POST /v1/agents and mesh discovery silently fails.
* learnEgress: default ClawSandbox 'networkPolicy.learnEgress = true'
  in dev mode so the forward proxy logs new domains instead of blocking
  them. Without this, Telegram/Slack/Discord channels fail at startup
  with 'Network request for deleteMyCommands failed'. Operators promote
  learned domains via 'azureclaw policy allow' once happy.
* TELEGRAM_ALLOW_FROM: mirror docker mode's resolveChannelTokens flow
  and pull the saved allow-list from 'azureclaw credentials' into the
  '<name>-credentials' Secret. Without this, local-k8s sandboxes started
  Telegram unrestricted (any chat could DM the bot) while docker mode
  honoured the allow-list.
* governance on parent: emit 'spec.governance.enabled: true' in the
  dev YAML so the controller injects AGT_RELAY_URL / AGT_REGISTRY_URL /
  AGT_GOVERNANCE_ENABLED into both containers. Sub-agents are auto-enabled
  by the router spawn helper; the parent must be enabled in the source
  YAML because nothing else turns it on.
* ToolPolicy stub: emit a permissive default '<name>-toolpolicy' in
  the bundle. The router unconditionally injects
  governance.toolPolicyRef = '<parent>-toolpolicy' into spawned sub-agent
  CRs; without this stub every spawn lands in Degraded with
  ToolPolicyNotFound.
* FOUNDRY_PROJECT_ENDPOINT: emit via chart 'foundry.projectEndpoint'
  value (which the controller-deployment template handles) instead of
  duplicating it in 'extraEnv' — server-side apply rejected the latter
  with 'duplicate entries for key'.
* gateway token discovery in startSandboxConnect: read the
  'gateway-token' Secret instead of 'kubectl exec cat /tmp/gateway-token'.
  The exec path is blocked by the ValidatingAdmissionPolicy
  'azureclaw-sandbox-exec-ban' and silently 403s, timing out after 3 min
  even though the gateway is up. Matches how 'azureclaw connect' reads it.
* headlamp chart pinned to 0.41.0: the AzureClaw plugin is built against
  @kinvolk/headlamp-plugin ^0.13.0 and depends on a specific pluginLib
  API surface (KubeObject + SimpleTable + SectionBox + Link). 0.42+
  drifts enough to break the plugin's sidebar/list views. Bump
  intentionally after re-testing.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* tools(headlamp-plugin): redesign as AzureClaw operator dashboard

Rewrites the headlamp plugin from a broken 254-line stub into a
~700-line operator dashboard for teams running AzureClaw on AKS.

Bug fix:
* 'class extends KubeObject' never wires up apiEndpoint in Headlamp
  0.13's plugin API, so every list view rendered 'Error loading
  clawsandboxes'. Switched to the documented 'makeCustomResourceClass'
  factory (same pattern used by Headlamp's own flux + karpenter
  plugins) and avoid the broken ResourceListView+TableFromResourceClass
  code path by using cls.useList() + SimpleTable directly.

Operator dashboard ('/azureclaw'):
* 11 stat tiles: total sandboxes, by phase (running / pending /
  degraded), egress mode counts (learn vs strict default-correct),
  channel count, runtime mix, inference + tool policies, memories,
  MCP servers, A2A agents.
* Sandboxes-by-phase / runtimes / channels-in-use breakdown tables.
* 'Recent Sandboxes' table with model resolution (inline or via
  InferencePolicy ref) and egress mode that matches the controller
  default ('absent block ⇒ Learn').

CRD coverage (9 CRDs, sidebar + list + detail):
* ClawSandbox, InferencePolicy, ToolPolicy, ClawMemory, McpServer,
  A2aAgent, EgressAllowlist, EgressApproval, IngressPolicy.

ClawSandbox detail extras:
* Network Policy card with controller-matched defaults
* Channels card — detects Telegram/Slack/Discord/WhatsApp from the
  '<name>-credentials' Secret in 'azureclaw-<name>' (Source: Secret)
  *and* from spec.channels (Source: Spec)
* Related Resources card — linked InferencePolicy, ToolPolicy,
  ClawMemory, McpServers
* Mesh card (governance enabled, registry mode, trust threshold)
* Deep links to Pod and Workspace ConfigMap in the sandbox pod ns

Other:
* shortModel() helper: strips provider prefix from LiteLLM-style
  identifiers so 'azure/gpt-5.4' and a plain InferencePolicy
  deployment 'gpt-5.4' both render the same.
* Sub-agent model resolution via spec.inferenceRef → InferencePolicy
  lookup (sub-agents have empty 'runtime.openclaw.config').
* Add tsconfig.json (was missing) — extends Headlamp's default
  plugins-tsconfig so JSX compiles.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

---------

Co-authored-by: Pal Lakatos-Toth <pallakatos@github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@pallakatos pallakatos deleted the phase5-agt-default branch June 1, 2026 14:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants