feat: agent self-correction via validation feedback loop#57
Merged
Conversation
Restructure validation into composable steps so typecheck (~5s) runs independently before full validation. Quick checks short-circuit on typecheck failure and format errors as actionable agent prompts, laying the foundation for the agent retry loop.
Extend the async generator in agent-interface to yield follow-up correction prompts when quick-checks (typecheck/build) fail. The agent retains full conversation context and gets up to 2 chances to fix its own mistakes before results surface to the user. Configurable via maxRetries option (default 2, 0 to disable).
Add retry-aware execution to AgentExecutor using the same async generator + quick-checks pattern from production. Evals now track three tiers: first-attempt, with-correction, and with-retry pass rates. Adds --no-correction flag to disable for baseline comparison.
AgentExecutor now delegates to the production runAgent instead of reimplementing the retry-aware async generator. Exports AgentRunConfig so evals can construct it directly, adds onMessage hook for latency tracking. Includes 13 tests verifying the wiring.
…rics First-attempt now means zero corrections, which is stricter than before. Lower threshold to 30% (aspirational), add withCorrectionPassRate at 90% as the primary quality gate, keep withRetryPassRate at 95%.
Two eval runs show ~21-27% first-attempt rate. The correction loop consistently brings it to 93-100%. Set threshold at 20% to catch regressions without failing on normal variance.
…hreshold detectTypecheckCommand was falling back to npx tsc --noEmit for every project including Python, Ruby, Go, etc. Now checks for tsconfig.json before falling back — no tsconfig means skip typecheck entirely. This eliminates false correction triggers on non-JS frameworks. Raises first-attempt threshold to 50% since the false positives were the main driver of the low rate.
…port Extend quick-checks to auto-detect Go (go.mod), Elixir (mix.exs), .NET (*.csproj), and Kotlin/Java (build.gradle) build commands from project files. Interpreted languages (Python, Ruby, PHP) pass through silently — no universal build command exists for them.
…parsing Raise firstAttemptPassRate from 50% to 80% now that false positives from non-TS projects are eliminated (85.7% observed in latest run). Fix quality grader parsing: the greedy regex matched braces inside <thinking> tags. Now extracts JSON only after </thinking> and uses a non-greedy pattern to avoid capturing nested objects.
…move dead code Extract passResult helper (4 identical object literals → 1 function), unify parseTypecheckErrors into single regex with Set dedup, extract quickCheckValidateAndFormat shared between agent-runner and eval executor, remove getIntegration indirection and dead continueUrl param.
lucasmotta
added a commit
that referenced
this pull request
Feb 20, 2026
…ts-skills * origin/main: (21 commits) chore(main): release 0.7.2 (#67) fix: Correct issue submission links (#66) chore(main): release 0.7.1 (#65) fix: ground AI analysis in SDK documentation (#64) chore(main): release 0.7.0 (#60) fix: improve installer skill and remove shell: true from spawn calls (#63) feat: major workos doctor overhaul — visual refresh, multi-language, AI analysis (#62) fix: replace dotenv devDependency with inline env parser in doctor (#61) feat: add environment, organization, and user management commands (#59) chore(main): release 0.6.0 (#58) feat: agent self-correction via validation feedback loop (#57) chore(main): release 0.5.4 (#56) fix: restore workflow_call and remove registry-url for OIDC chore(main): release 0.5.3 (#55) fix: trigger release.yml directly via release event for OIDC match fix: remove registry-url from setup-node to unblock OIDC auth chore(main): release 0.5.2 (#54) fix: use npm publish for OIDC trusted publishing support chore(main): release 0.5.1 (#53) fix: remove duplicate release trigger causing publish race condition ... # Conflicts: # src/lib/adapters/cli-adapter.ts
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
runAgentso evals exercise the actual retry pathWhy
The installer ran its agent as a single-shot operation — when validation caught fixable issues, the results went to the user, not back to the agent. The agent never got a chance to fix its own mistakes.
Eval results (14 frameworks,
--state=example):Architecture
The retry loop uses an async generator that yields follow-up user messages into the SDK's
query(). The agent retains full conversation context.Changes
Quick checks (
src/lib/validation/quick-checks.ts): Typecheck + build as composable steps. Short-circuits on typecheck failure.quickCheckValidateAndFormatshared between production and evals.Multi-ecosystem build detection (
src/lib/validation/build-validator.ts):detectBuildCommandchecks package.json, go.mod, mix.exs, *.csproj, build.gradle. Returns null for interpreted languages.Retry loop (
src/lib/agent-interface.ts): Async generator yields correction prompts on validation failure. Promise-based turn coordination. ExportsAgentRunConfig+onMessagehook for evals.Evals (
tests/evals/agent-executor.ts): Delegates to productionrunAgent. Three-tier success criteria: first-attempt (80%), with-correction (90%), with-retry (95%).--no-correctionflag.Validator composability (
src/lib/validation/validator.ts): ExportedvalidatePackages,validateEnvVars,validateFiles,validateFrameworkSpecificwith return-based signatures.Notes
<thinking>tags)