Skip to content

test(harness): smart mock LLM provider + fake Composio backend + 21 new tests#1729

Merged
senamakel merged 5 commits into
tinyhumansai:mainfrom
senamakel:feat/smarter-mock-harness-tests
May 14, 2026
Merged

test(harness): smart mock LLM provider + fake Composio backend + 21 new tests#1729
senamakel merged 5 commits into
tinyhumansai:mainfrom
senamakel:feat/smarter-mock-harness-tests

Conversation

@senamakel
Copy link
Copy Markdown
Member

@senamakel senamakel commented May 14, 2026

Summary

  • New harness::test_support module with two reusable test utilities:
    • KeywordScriptedProvider — a real Provider impl that drives the live run_tool_call_loop by reacting to the rolling conversation (latest user/tool message) via keyword rules. Supports both prompt-guided XML and native OpenAI-style tool_calls, plus a forced-response queue.
    • spawn_fake_composio_backend — a hermetic in-process axum server serving realistic gmail / notion / github / slack fixture data wired into a real ComposioClient.
  • 21 new behavioural / corner-case tests for the agent harness and tool-invocation flow (see Solution).
  • Keyword-driven smart mock at /openai/v1/chat/completions on the existing JS mock backend (scripts/mock-api/routes/llm.mjs) that mirrors the Rust-side mental model.
  • Documents one surfaced quirk in parse_arguments_value (silent drop of non-JSON arguments strings) via a dedicated documents_… test pinning the current behaviour.

Problem

The agent harness — system-prompt → LLM → tool-call → tool exec → result → loop — was light on tests that exercise the entire path end-to-end. The pre-existing ScriptedProvider in tool_loop_tests replays a fixed response queue, so tests can't easily react to what tools returned in a previous iteration. Composio integration tests stopped at the HTTP boundary, leaving the agent → ComposioActionTool → backend round-trip uncovered. There was no easy way to drive realistic LLM behaviour from an E2E spec either — the OpenAI mock was a single canned reply.

Solution

  • src/openhuman/agent/harness/test_support.rs
    • KeywordScriptedProvider { rules, native_tools, vision, fallback, forced_queue } — implements Provider::chat, inspects the latest user/tool message, picks the first matching KeywordRule, and emits either an XML <tool_call> block (prompt-guided) or a native tool_calls payload, plus optional text. Returns the fallback string to terminate the loop if nothing matches. Records every turn for assertion.
    • spawn_fake_composio_backend(ComposioFixture) boots an axum server on 127.0.0.1:0 serving /agent-integrations/composio/{toolkits, connections, tools, authorize, execute, connections/:id} with realistic gmail / notion / github / slack data + per-action execute responses. FakeComposioBackend::client() hands back a real ComposioClient pointed at it.
  • src/openhuman/agent/harness/test_support_tests.rs — 12 behavioural tests covering: prompt-guided + native dispatch, multi-tool chaining across iterations, unknown-tool error path, MaxIterationsExceeded guard, CliRpcOnly refusal, ToolResult::error + anyhow::bail! propagation, visible_tool_names whitelist, extra_tools, and a full agent → ComposioActionTool → fake backend round-trip.
  • src/openhuman/agent/harness/bughunt_tests.rs — 9 targeted corner-case tests covering: JSON-encoded-string arguments round-trip, silent drop of non-JSON arguments strings (surfaced quirk — flagged for follow-up), parallel <tool_call> blocks in one iteration, same-name registry tools, markdown-fenced tool_call blocks, native-vs-XML precedence (no double-fire), per-tool max_result_size_chars cap, empty-response termination, and AgentProgress lifecycle ordering.
  • scripts/mock-api/routes/llm.mjs — JS-side smart LLM mock at the same OpenAI completions URL, controlled via admin behaviours llmKeywordRules, llmForcedResponses, llmFallbackContent. Backwards-compatible default ("Hello from e2e mock agent") so existing E2E specs keep passing.

All 21 new tests pass locally.

Submission Checklist

  • Tests added or updated (happy path + at least one failure / edge case) per Testing Strategy
  • N/A: Diff-coverage gate not applicable — this PR is purely additive new test code (no production lines changed beyond a 10-line route stub in scripts/mock-api/routes/integrations.mjs).
  • N/A: behaviour-only change — no user-visible features added/removed/renamed, so the docs/TEST-COVERAGE-MATRIX.md doesn't need a new row.
  • N/A: no matrix feature IDs touched.
  • No new external network dependencies introduced (mock backend used per Testing Strategy)
  • N/A: no release-cut surface touched.
  • N/A: no linked issue.

Impact

  • Runtime: zero — test_support is #[cfg(test)]-gated, ships in dev builds only.
  • Mock backend: the keyword-driven LLM endpoint defaults to the original canned reply, so unrelated E2E specs are unaffected.
  • Performance / security / migration: none.

Related

  • Closes:
  • Follow-up PR(s)/TODOs: parse_arguments_value should surface a structured error to the LLM instead of silently substituting {} when an arguments string fails to parse — pinned by documents_silent_drop_of_non_json_arguments_string.

AI Authored PR Metadata (required for Codex/Linear PRs)

Linear Issue

  • Key: N/A
  • URL: N/A

Commit & Branch

  • Branch: feat/smarter-mock-harness-tests
  • Commit SHA: a64134b

Validation Run

  • pnpm --filter openhuman-app format:check
  • pnpm typecheck
  • Focused tests: cargo test --lib -- test_support_tests bughunt_tests (21 passed)
  • Rust fmt/check (if changed): cargo fmt --check + cargo check --manifest-path Cargo.toml
  • N/A: Tauri fmt/check — no Tauri shell changes

Validation Blocked

  • command: N/A
  • error: N/A
  • impact: N/A

Behavior Changes

  • Intended behavior change: none in production code; adds test infrastructure + tests only.
  • User-visible effect: none.

Parity Contract

  • Legacy behavior preserved: yes — JS mock default response unchanged, no harness code modified.
  • Guard/fallback/dispatch parity checks: N/A.

Duplicate / Superseded PR Handling

  • Duplicate PR(s): N/A
  • Canonical PR: this one
  • Resolution: N/A

Summary by CodeRabbit

  • New Features

    • Added a mock LLM completion handler with a forced-response queue, keyword-driven scripted replies (including simulated tool calls), and configurable fallback content.
  • Tests

    • Added extensive end-to-end and unit tests for the agent harness and tool-call loop, covering argument parsing, multi-tool chaining, truncation, error handling, permission gating, orchestration delegation, and a fake backend for integration-style tests.
    • Added test utilities and fixtures to script provider behavior and simulate backends.
  • Chores

    • Updated mock API routing so LLM completion requests are handled earlier by the new LLM handler.

Review Change Stack

senamakel added 3 commits May 14, 2026 03:25
…st utils

Introduces `harness::test_support`, a #[cfg(test)] module providing:

* `KeywordScriptedProvider` — a `Provider` that reacts to the rolling
  conversation state via keyword rules, supporting both prompt-guided
  (XML `<tool_call>`) and native OpenAI-style `tool_calls` surfaces.
  Records every turn for post-hoc assertion.
* `spawn_fake_composio_backend` — a hermetic in-process axum server
  serving realistic Composio fixture data (gmail/notion/github/slack
  toolkits, connections, tools, and execute responses).

Adds an initial 12-test behavioural suite (`test_support_tests`)
that drives the real `run_tool_call_loop` against these utilities to
cover: prompt-guided + native tool dispatch, multi-tool chaining,
unknown tools, max-iteration guards, CliRpcOnly refusal, error
propagation from tools, visibility whitelists, extra_tools, and a
full agent -> ComposioActionTool -> fake backend round-trip.
…ases

Nine targeted tests using the new KeywordScriptedProvider to surface
behaviours that aren't covered by the existing harness suite:

* Native tool_calls with JSON-encoded string args round-trip correctly.
* DOCUMENTED QUIRK: `parse_arguments_value` silently swallows
  non-JSON string args and replaces them with `{}` — no signal to the
  LLM that its input was unparseable. Flagged for follow-up; the test
  pins the current behaviour so a fix lands deliberately.
* Multiple <tool_call> blocks in a single assistant turn each execute.
* Same-name tool collisions resolve to the first registry entry.
* Markdown-fenced ```tool_call``` blocks parse correctly.
* Native tool_calls take precedence over XML in the same response (no
  double-fire of the same logical call).
* Per-tool `max_result_size_chars` caps history payload.
* Empty response with no tool calls terminates cleanly.
* Progress sink emits TurnStarted -> IterationStarted ->
  ToolCallStarted -> ToolCallCompleted -> TurnCompleted in order.
Replaces the static "Hello from e2e mock agent" stub with a smart
mock LLM that mirrors the Rust-side KeywordScriptedProvider:

* `llmKeywordRules` (JSON array of {keyword, content, toolCalls})
  drives keyword-matched responses, including OpenAI-style native
  `tool_calls` payloads.
* `llmForcedResponses` (JSON array) acts as a one-shot replay queue
  that takes precedence over keyword rules.
* `llmFallbackContent` overrides the default final reply.

Defaults to the original "Hello from e2e mock agent" content when
nothing is configured, so existing E2E specs keep passing.
@senamakel senamakel requested a review from a team May 14, 2026 11:18
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 14, 2026

📝 Walkthrough

Walkthrough

Refactors mock API routing by extracting OpenAI chat-completions into a dedicated LLM handler and registers it early; adds comprehensive agent-harness test support including a keyword-scripted provider, fake Composio backend, and extensive unit and end-to-end tests for tool dispatch.

Changes

Mock API LLM Route Refactoring

Layer / File(s) Summary
LLM chat completions handler
scripts/mock-api/routes/llm.mjs
Implements handleLlmCompletions with three routing tiers: forced-response queue, keyword-rule matching (extracting latest message text, lowercased substring matching), and fallback content. Helper functions extract probe text from messages, construct ChatCompletion responses with optional tool_calls, and build completion payloads.
Route registration and integrations cleanup
scripts/mock-api/server.mjs, scripts/mock-api/routes/integrations.mjs
Imports and registers handleLlmCompletions ahead of the integrations catch-all, and removes the inline POST /openai/v1/chat/completions handling from integrations, documenting the delegation.

Agent Harness Testing Infrastructure and Suites

Layer / File(s) Summary
Test module declarations
src/openhuman/agent/harness/mod.rs
Adds #[cfg(test)] submodules: bughunt_tests, pub(crate) test_support, and test_support_tests.
KeywordScriptedProvider and test fixtures
src/openhuman/agent/harness/test_support.rs
Adds KeywordScriptedProvider, KeywordRule, ScriptedToolCall, ProviderTurn, ComposioFixture, and FakeComposioBackend, plus spawn_fake_composio_backend to start an Axum fake backend and capture requests.
Tool dispatch unit tests
src/openhuman/agent/harness/bughunt_tests.rs
Adds focused Tokio tests and test tools validating JSON argument decoding, non-JSON fallback, multi-tool dispatch, registry precedence, fenced markdown parsing, native tool_calls precedence, per-tool result truncation, empty assistant responses, and progress-sink lifecycle events.
End-to-end behavioral tests
src/openhuman/agent/harness/test_support_tests.rs
Adds many end-to-end tests exercising run_tool_call_loop: scripted-provider flows, native tool_calls, chaining, unknown-tool handling, max-iteration guards, ToolScope permission behaviors, error propagation, visible_tool_names filtering, extra_tools reachability, and Composio-backed integration with request assertions including an orchestrator-to-delegation delegation chain test.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • tinyhumansai/openhuman#1488: Related orchestrator delegation and integrations execution tests that exercise delegation-tool routing and backend execute payloads.

Suggested reviewers

  • graycyrus

Poem

🐇 A rabbit scribbles tests by night,

Mock LLMs answer with scripted might,
Providers cue and backends fake,
Tool calls hop and results they make,
Hooray — the harness wakes!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 35.63% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main changes: adding a smart mock LLM provider, fake Composio backend, and 21 new tests for the harness.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

Warning

Review ran into problems

🔥 Problems

Stopped waiting for pipeline failures after 30000ms. One of your pipelines takes longer than our 30000ms fetch window to run, so review may not consider pipeline-failure results for inline comments if any failures occurred after the fetch window. Increase the timeout if you want to wait longer or run a @coderabbit review after the pipeline has finished.


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/openhuman/agent/harness/bughunt_tests.rs`:
- Around line 65-437: The PR is failing CI due to rustfmt violations in this
test file (contains tests like
native_tool_call_decodes_json_encoded_arguments_string,
documents_silent_drop_of_non_json_arguments_string,
parallel_tool_calls_in_single_iteration_all_execute, etc.); fix by running
rustfmt on the crate (run cargo fmt --all) and re-running the Rust checks (cargo
check / cargo test) to ensure the formatting changes are committed, then update
the branch with the formatted file so the formatting errors are resolved.

In `@src/openhuman/agent/harness/test_support_tests.rs`:
- Around line 10-511: The file fails rustfmt checks—format the Rust test file
and re-run checks: run rustfmt (cargo fmt --all) or rustfmt on the crate, then
run cargo check (or cargo test) to ensure no formatting/type issues remain;
commit the formatted changes. Focus on this test module (symbols: RecordingTool,
CliOnlyTool, FailingTool, PanickyTool and test functions like
keyword_provider_drives_prompt_guided_tool_loop_to_completion,
keyword_provider_drives_native_tool_calls_path,
keyword_provider_chains_multiple_tools_across_iterations) so the diff only
contains whitespace/formatting fixes and CI passes.

In `@src/openhuman/agent/harness/test_support.rs`:
- Line 281: The new test_support module is not rustfmt-formatted; run rustfmt
(or cargo fmt) on the module (the test_support.rs file) to apply formatting
changes, then re-run CI checks (cargo check) to ensure the rustfmt diffs are
resolved before merging; commit the formatted file so CI passes.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 7cb9ee92-2c5e-411f-8206-2690568df119

📥 Commits

Reviewing files that changed from the base of the PR and between bf9404a and a64134b.

📒 Files selected for processing (7)
  • scripts/mock-api/routes/integrations.mjs
  • scripts/mock-api/routes/llm.mjs
  • scripts/mock-api/server.mjs
  • src/openhuman/agent/harness/bughunt_tests.rs
  • src/openhuman/agent/harness/mod.rs
  • src/openhuman/agent/harness/test_support.rs
  • src/openhuman/agent/harness/test_support_tests.rs

Comment thread src/openhuman/agent/harness/bughunt_tests.rs
Comment thread src/openhuman/agent/harness/test_support_tests.rs
Comment thread src/openhuman/agent/harness/test_support.rs
coderabbitai[bot]
coderabbitai Bot previously approved these changes May 14, 2026
Adds an end-to-end test that proves the full chain a user-logged-in-via-
Composio relies on:

  1. `orchestrator::prompt::build` advertises the connected toolkit via
     the collapsed `delegate_to_integrations_agent` tool.
  2. Given that system prompt and a user task mentioning gmail, the mock
     LLM emits a delegation tool call that satisfies the real
     SkillDelegationTool schema (`{toolkit: "gmail", prompt: ...}`).
  3. A `TestDelegationTool` (test stand-in for SkillDelegationTool that
     skips the heavyweight sub-agent runner) runs a NESTED
     `run_tool_call_loop` for the integrations side — the same code path
     integrations_agent uses — with a real ComposioActionTool wired to
     the hermetic fake Composio backend.
  4. The fake backend records the `/execute` call with the
     orchestrator-routed arguments (recipient_email, subject, body), and
     the final reply propagates back to the user.
@senamakel senamakel merged commit 1c58d47 into tinyhumansai:main May 14, 2026
25 of 27 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant