feat: agent self-correction via validation feedback loop by nicknisi · Pull Request #57 · workos/cli

nicknisi · 2026-02-14T14:50:43Z

Summary

Add fast typecheck/build validation that runs between agent turns, giving the agent structured feedback to self-correct within the same session (up to 2 retries)
Auto-detect build systems across ecosystems: JS (package.json), Go (go.mod), Elixir (mix.exs), .NET (*.csproj), Kotlin/Java (build.gradle) — interpreted languages pass through silently
Unify eval executor with production runAgent so evals exercise the actual retry path
Track three-tier pass rates: first-attempt, with-correction, with-retry

Why

The installer ran its agent as a single-shot operation — when validation caught fixable issues, the results went to the user, not back to the agent. The agent never got a chance to fix its own mistakes.

Eval results (14 frameworks, --state=example):

Metric	Value
First-attempt pass rate	92.9%
With-correction pass rate	100%
Self-corrected scenarios	1 of 14
Quality score	4.5/5

Architecture

Agent writes code
    ↓
Typecheck (tsc --noEmit, TS only, ~5s)
    ↓
Pass? → Build (auto-detected per ecosystem)
    ↓         ↓
    ↓    Pass? → Full validation (env vars, files, patterns)
    ↓      ↓
    ↓   Format errors → yield back into same SDK conversation
    ↓      ↓
    ↓   Agent fixes (retains full context, max 2 retries)

The retry loop uses an async generator that yields follow-up user messages into the SDK's query(). The agent retains full conversation context.

Changes

Quick checks (src/lib/validation/quick-checks.ts): Typecheck + build as composable steps. Short-circuits on typecheck failure. quickCheckValidateAndFormat shared between production and evals.

Multi-ecosystem build detection (src/lib/validation/build-validator.ts): detectBuildCommand checks package.json, go.mod, mix.exs, *.csproj, build.gradle. Returns null for interpreted languages.

Retry loop (src/lib/agent-interface.ts): Async generator yields correction prompts on validation failure. Promise-based turn coordination. Exports AgentRunConfig + onMessage hook for evals.

Evals (tests/evals/agent-executor.ts): Delegates to production runAgent. Three-tier success criteria: first-attempt (80%), with-correction (90%), with-retry (95%). --no-correction flag.

Validator composability (src/lib/validation/validator.ts): Exported validatePackages, validateEnvVars, validateFiles, validateFrameworkSpecific with return-based signatures.

Notes

dotnet eval scenario disabled (broken SDK)
Quality grader JSON parsing fixed (greedy regex matched braces inside <thinking> tags)

Restructure validation into composable steps so typecheck (~5s) runs independently before full validation. Quick checks short-circuit on typecheck failure and format errors as actionable agent prompts, laying the foundation for the agent retry loop.

Extend the async generator in agent-interface to yield follow-up correction prompts when quick-checks (typecheck/build) fail. The agent retains full conversation context and gets up to 2 chances to fix its own mistakes before results surface to the user. Configurable via maxRetries option (default 2, 0 to disable).

Add retry-aware execution to AgentExecutor using the same async generator + quick-checks pattern from production. Evals now track three tiers: first-attempt, with-correction, and with-retry pass rates. Adds --no-correction flag to disable for baseline comparison.

AgentExecutor now delegates to the production runAgent instead of reimplementing the retry-aware async generator. Exports AgentRunConfig so evals can construct it directly, adds onMessage hook for latency tracking. Includes 13 tests verifying the wiring.

…rics First-attempt now means zero corrections, which is stricter than before. Lower threshold to 30% (aspirational), add withCorrectionPassRate at 90% as the primary quality gate, keep withRetryPassRate at 95%.

Two eval runs show ~21-27% first-attempt rate. The correction loop consistently brings it to 93-100%. Set threshold at 20% to catch regressions without failing on normal variance.

…hreshold detectTypecheckCommand was falling back to npx tsc --noEmit for every project including Python, Ruby, Go, etc. Now checks for tsconfig.json before falling back — no tsconfig means skip typecheck entirely. This eliminates false correction triggers on non-JS frameworks. Raises first-attempt threshold to 50% since the false positives were the main driver of the low rate.

…port Extend quick-checks to auto-detect Go (go.mod), Elixir (mix.exs), .NET (*.csproj), and Kotlin/Java (build.gradle) build commands from project files. Interpreted languages (Python, Ruby, PHP) pass through silently — no universal build command exists for them.

…parsing Raise firstAttemptPassRate from 50% to 80% now that false positives from non-TS projects are eliminated (85.7% observed in latest run). Fix quality grader parsing: the greedy regex matched braces inside <thinking> tags. Now extracts JSON only after </thinking> and uses a non-greedy pattern to avoid capturing nested objects.

…move dead code Extract passResult helper (4 identical object literals → 1 function), unify parseTypecheckErrors into single regex with Set dedup, extract quickCheckValidateAndFormat shared between agent-runner and eval executor, remove getIntegration indirection and dead continueUrl param.

…ts-skills * origin/main: (21 commits) chore(main): release 0.7.2 (#67) fix: Correct issue submission links (#66) chore(main): release 0.7.1 (#65) fix: ground AI analysis in SDK documentation (#64) chore(main): release 0.7.0 (#60) fix: improve installer skill and remove shell: true from spawn calls (#63) feat: major workos doctor overhaul — visual refresh, multi-language, AI analysis (#62) fix: replace dotenv devDependency with inline env parser in doctor (#61) feat: add environment, organization, and user management commands (#59) chore(main): release 0.6.0 (#58) feat: agent self-correction via validation feedback loop (#57) chore(main): release 0.5.4 (#56) fix: restore workflow_call and remove registry-url for OIDC chore(main): release 0.5.3 (#55) fix: trigger release.yml directly via release event for OIDC match fix: remove registry-url from setup-node to unblock OIDC auth chore(main): release 0.5.2 (#54) fix: use npm publish for OIDC trusted publishing support chore(main): release 0.5.1 (#53) fix: remove duplicate release trigger causing publish race condition ... # Conflicts: # src/lib/adapters/cli-adapter.ts

nicknisi added 14 commits February 14, 2026 07:30

fix: recalibrate success criteria thresholds for correction-aware met…

81a374e

…rics First-attempt now means zero corrections, which is stricter than before. Lower threshold to 30% (aspirational), add withCorrectionPassRate at 90% as the primary quality gate, keep withRetryPassRate at 95%.

chore: disable dotnet eval scenario (broken SDK, no runtime)

61ee472

fix: lower first-attempt threshold to 20% to match observed baseline

f891dfe

Two eval runs show ~21-27% first-attempt rate. The correction loop consistently brings it to 93-100%. Set threshold at 20% to catch regressions without failing on normal variance.

chore: formatting

b21edf7

chore: remove comment slop and dead validation:quick events

719fd6b

chore: formatting

57047fc

nicknisi merged commit 920fc87 into main Feb 14, 2026
5 checks passed

nicknisi deleted the nicknisi/cli-agent-resiliance branch February 14, 2026 21:50

github-actions bot mentioned this pull request Feb 14, 2026

chore(main): release 0.6.0 #58

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

feat: agent self-correction via validation feedback loop#57

feat: agent self-correction via validation feedback loop#57
nicknisi merged 14 commits intomainfrom
nicknisi/cli-agent-resiliance

nicknisi commented Feb 14, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

nicknisi commented Feb 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

Architecture

Changes

Notes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

1 participant

nicknisi commented Feb 14, 2026 •

edited

Loading