test(harness): smart mock LLM provider + fake Composio backend + 21 new tests#1729
Conversation
…st utils Introduces `harness::test_support`, a #[cfg(test)] module providing: * `KeywordScriptedProvider` — a `Provider` that reacts to the rolling conversation state via keyword rules, supporting both prompt-guided (XML `<tool_call>`) and native OpenAI-style `tool_calls` surfaces. Records every turn for post-hoc assertion. * `spawn_fake_composio_backend` — a hermetic in-process axum server serving realistic Composio fixture data (gmail/notion/github/slack toolkits, connections, tools, and execute responses). Adds an initial 12-test behavioural suite (`test_support_tests`) that drives the real `run_tool_call_loop` against these utilities to cover: prompt-guided + native tool dispatch, multi-tool chaining, unknown tools, max-iteration guards, CliRpcOnly refusal, error propagation from tools, visibility whitelists, extra_tools, and a full agent -> ComposioActionTool -> fake backend round-trip.
…ases
Nine targeted tests using the new KeywordScriptedProvider to surface
behaviours that aren't covered by the existing harness suite:
* Native tool_calls with JSON-encoded string args round-trip correctly.
* DOCUMENTED QUIRK: `parse_arguments_value` silently swallows
non-JSON string args and replaces them with `{}` — no signal to the
LLM that its input was unparseable. Flagged for follow-up; the test
pins the current behaviour so a fix lands deliberately.
* Multiple <tool_call> blocks in a single assistant turn each execute.
* Same-name tool collisions resolve to the first registry entry.
* Markdown-fenced ```tool_call``` blocks parse correctly.
* Native tool_calls take precedence over XML in the same response (no
double-fire of the same logical call).
* Per-tool `max_result_size_chars` caps history payload.
* Empty response with no tool calls terminates cleanly.
* Progress sink emits TurnStarted -> IterationStarted ->
ToolCallStarted -> ToolCallCompleted -> TurnCompleted in order.
Replaces the static "Hello from e2e mock agent" stub with a smart
mock LLM that mirrors the Rust-side KeywordScriptedProvider:
* `llmKeywordRules` (JSON array of {keyword, content, toolCalls})
drives keyword-matched responses, including OpenAI-style native
`tool_calls` payloads.
* `llmForcedResponses` (JSON array) acts as a one-shot replay queue
that takes precedence over keyword rules.
* `llmFallbackContent` overrides the default final reply.
Defaults to the original "Hello from e2e mock agent" content when
nothing is configured, so existing E2E specs keep passing.
📝 WalkthroughWalkthroughRefactors mock API routing by extracting OpenAI chat-completions into a dedicated LLM handler and registers it early; adds comprehensive agent-harness test support including a keyword-scripted provider, fake Composio backend, and extensive unit and end-to-end tests for tool dispatch. ChangesMock API LLM Route Refactoring
Agent Harness Testing Infrastructure and Suites
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Warning Review ran into problems🔥 ProblemsStopped waiting for pipeline failures after 30000ms. One of your pipelines takes longer than our 30000ms fetch window to run, so review may not consider pipeline-failure results for inline comments if any failures occurred after the fetch window. Increase the timeout if you want to wait longer or run a Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/openhuman/agent/harness/bughunt_tests.rs`:
- Around line 65-437: The PR is failing CI due to rustfmt violations in this
test file (contains tests like
native_tool_call_decodes_json_encoded_arguments_string,
documents_silent_drop_of_non_json_arguments_string,
parallel_tool_calls_in_single_iteration_all_execute, etc.); fix by running
rustfmt on the crate (run cargo fmt --all) and re-running the Rust checks (cargo
check / cargo test) to ensure the formatting changes are committed, then update
the branch with the formatted file so the formatting errors are resolved.
In `@src/openhuman/agent/harness/test_support_tests.rs`:
- Around line 10-511: The file fails rustfmt checks—format the Rust test file
and re-run checks: run rustfmt (cargo fmt --all) or rustfmt on the crate, then
run cargo check (or cargo test) to ensure no formatting/type issues remain;
commit the formatted changes. Focus on this test module (symbols: RecordingTool,
CliOnlyTool, FailingTool, PanickyTool and test functions like
keyword_provider_drives_prompt_guided_tool_loop_to_completion,
keyword_provider_drives_native_tool_calls_path,
keyword_provider_chains_multiple_tools_across_iterations) so the diff only
contains whitespace/formatting fixes and CI passes.
In `@src/openhuman/agent/harness/test_support.rs`:
- Line 281: The new test_support module is not rustfmt-formatted; run rustfmt
(or cargo fmt) on the module (the test_support.rs file) to apply formatting
changes, then re-run CI checks (cargo check) to ensure the rustfmt diffs are
resolved before merging; commit the formatted file so CI passes.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 7cb9ee92-2c5e-411f-8206-2690568df119
📒 Files selected for processing (7)
scripts/mock-api/routes/integrations.mjsscripts/mock-api/routes/llm.mjsscripts/mock-api/server.mjssrc/openhuman/agent/harness/bughunt_tests.rssrc/openhuman/agent/harness/mod.rssrc/openhuman/agent/harness/test_support.rssrc/openhuman/agent/harness/test_support_tests.rs
Adds an end-to-end test that proves the full chain a user-logged-in-via-
Composio relies on:
1. `orchestrator::prompt::build` advertises the connected toolkit via
the collapsed `delegate_to_integrations_agent` tool.
2. Given that system prompt and a user task mentioning gmail, the mock
LLM emits a delegation tool call that satisfies the real
SkillDelegationTool schema (`{toolkit: "gmail", prompt: ...}`).
3. A `TestDelegationTool` (test stand-in for SkillDelegationTool that
skips the heavyweight sub-agent runner) runs a NESTED
`run_tool_call_loop` for the integrations side — the same code path
integrations_agent uses — with a real ComposioActionTool wired to
the hermetic fake Composio backend.
4. The fake backend records the `/execute` call with the
orchestrator-routed arguments (recipient_email, subject, body), and
the final reply propagates back to the user.
Summary
harness::test_supportmodule with two reusable test utilities:KeywordScriptedProvider— a realProviderimpl that drives the liverun_tool_call_loopby reacting to the rolling conversation (latest user/tool message) via keyword rules. Supports both prompt-guided XML and native OpenAI-styletool_calls, plus a forced-response queue.spawn_fake_composio_backend— a hermetic in-process axum server serving realistic gmail / notion / github / slack fixture data wired into a realComposioClient./openai/v1/chat/completionson the existing JS mock backend (scripts/mock-api/routes/llm.mjs) that mirrors the Rust-side mental model.parse_arguments_value(silent drop of non-JSONargumentsstrings) via a dedicateddocuments_…test pinning the current behaviour.Problem
The agent harness — system-prompt → LLM → tool-call → tool exec → result → loop — was light on tests that exercise the entire path end-to-end. The pre-existing
ScriptedProviderintool_loop_testsreplays a fixed response queue, so tests can't easily react to what tools returned in a previous iteration. Composio integration tests stopped at the HTTP boundary, leaving the agent → ComposioActionTool → backend round-trip uncovered. There was no easy way to drive realistic LLM behaviour from an E2E spec either — the OpenAI mock was a single canned reply.Solution
src/openhuman/agent/harness/test_support.rsKeywordScriptedProvider { rules, native_tools, vision, fallback, forced_queue }— implementsProvider::chat, inspects the latestuser/toolmessage, picks the first matchingKeywordRule, and emits either an XML<tool_call>block (prompt-guided) or a nativetool_callspayload, plus optional text. Returns the fallback string to terminate the loop if nothing matches. Records every turn for assertion.spawn_fake_composio_backend(ComposioFixture)boots an axum server on127.0.0.1:0serving/agent-integrations/composio/{toolkits, connections, tools, authorize, execute, connections/:id}with realistic gmail / notion / github / slack data + per-action execute responses.FakeComposioBackend::client()hands back a realComposioClientpointed at it.src/openhuman/agent/harness/test_support_tests.rs— 12 behavioural tests covering: prompt-guided + native dispatch, multi-tool chaining across iterations, unknown-tool error path,MaxIterationsExceededguard,CliRpcOnlyrefusal,ToolResult::error+anyhow::bail!propagation,visible_tool_nameswhitelist,extra_tools, and a full agent →ComposioActionTool→ fake backend round-trip.src/openhuman/agent/harness/bughunt_tests.rs— 9 targeted corner-case tests covering: JSON-encoded-stringargumentsround-trip, silent drop of non-JSONargumentsstrings (surfaced quirk — flagged for follow-up), parallel<tool_call>blocks in one iteration, same-name registry tools, markdown-fencedtool_callblocks, native-vs-XML precedence (no double-fire), per-toolmax_result_size_charscap, empty-response termination, andAgentProgresslifecycle ordering.scripts/mock-api/routes/llm.mjs— JS-side smart LLM mock at the same OpenAI completions URL, controlled via admin behavioursllmKeywordRules,llmForcedResponses,llmFallbackContent. Backwards-compatible default ("Hello from e2e mock agent") so existing E2E specs keep passing.All 21 new tests pass locally.
Submission Checklist
scripts/mock-api/routes/integrations.mjs).docs/TEST-COVERAGE-MATRIX.mddoesn't need a new row.Impact
test_supportis#[cfg(test)]-gated, ships in dev builds only.Related
parse_arguments_valueshould surface a structured error to the LLM instead of silently substituting{}when anargumentsstring fails to parse — pinned bydocuments_silent_drop_of_non_json_arguments_string.AI Authored PR Metadata (required for Codex/Linear PRs)
Linear Issue
Commit & Branch
Validation Run
pnpm --filter openhuman-app format:checkpnpm typecheckcargo test --lib -- test_support_tests bughunt_tests(21 passed)cargo fmt --check+cargo check --manifest-path Cargo.tomlValidation Blocked
command:N/Aerror:N/Aimpact:N/ABehavior Changes
Parity Contract
Duplicate / Superseded PR Handling
Summary by CodeRabbit
New Features
Tests
Chores