test(harness): smart mock LLM provider + fake Composio backend + 21 new tests by senamakel · Pull Request #1729 · tinyhumansai/openhuman

senamakel · 2026-05-14T11:18:44Z

Summary

New harness::test_support module with two reusable test utilities:
- KeywordScriptedProvider — a real Provider impl that drives the live run_tool_call_loop by reacting to the rolling conversation (latest user/tool message) via keyword rules. Supports both prompt-guided XML and native OpenAI-style tool_calls, plus a forced-response queue.
- spawn_fake_composio_backend — a hermetic in-process axum server serving realistic gmail / notion / github / slack fixture data wired into a real ComposioClient.
21 new behavioural / corner-case tests for the agent harness and tool-invocation flow (see Solution).
Keyword-driven smart mock at /openai/v1/chat/completions on the existing JS mock backend (scripts/mock-api/routes/llm.mjs) that mirrors the Rust-side mental model.
Documents one surfaced quirk in parse_arguments_value (silent drop of non-JSON arguments strings) via a dedicated documents_… test pinning the current behaviour.

Problem

The agent harness — system-prompt → LLM → tool-call → tool exec → result → loop — was light on tests that exercise the entire path end-to-end. The pre-existing ScriptedProvider in tool_loop_tests replays a fixed response queue, so tests can't easily react to what tools returned in a previous iteration. Composio integration tests stopped at the HTTP boundary, leaving the agent → ComposioActionTool → backend round-trip uncovered. There was no easy way to drive realistic LLM behaviour from an E2E spec either — the OpenAI mock was a single canned reply.

Solution

src/openhuman/agent/harness/test_support.rs
- KeywordScriptedProvider { rules, native_tools, vision, fallback, forced_queue } — implements Provider::chat, inspects the latest user/tool message, picks the first matching KeywordRule, and emits either an XML <tool_call> block (prompt-guided) or a native tool_calls payload, plus optional text. Returns the fallback string to terminate the loop if nothing matches. Records every turn for assertion.
- spawn_fake_composio_backend(ComposioFixture) boots an axum server on 127.0.0.1:0 serving /agent-integrations/composio/{toolkits, connections, tools, authorize, execute, connections/:id} with realistic gmail / notion / github / slack data + per-action execute responses. FakeComposioBackend::client() hands back a real ComposioClient pointed at it.
src/openhuman/agent/harness/test_support_tests.rs — 12 behavioural tests covering: prompt-guided + native dispatch, multi-tool chaining across iterations, unknown-tool error path, MaxIterationsExceeded guard, CliRpcOnly refusal, ToolResult::error + anyhow::bail! propagation, visible_tool_names whitelist, extra_tools, and a full agent → ComposioActionTool → fake backend round-trip.
src/openhuman/agent/harness/bughunt_tests.rs — 9 targeted corner-case tests covering: JSON-encoded-string arguments round-trip, silent drop of non-JSON arguments strings (surfaced quirk — flagged for follow-up), parallel <tool_call> blocks in one iteration, same-name registry tools, markdown-fenced tool_call blocks, native-vs-XML precedence (no double-fire), per-tool max_result_size_chars cap, empty-response termination, and AgentProgress lifecycle ordering.
scripts/mock-api/routes/llm.mjs — JS-side smart LLM mock at the same OpenAI completions URL, controlled via admin behaviours llmKeywordRules, llmForcedResponses, llmFallbackContent. Backwards-compatible default ("Hello from e2e mock agent") so existing E2E specs keep passing.

All 21 new tests pass locally.

Submission Checklist

Tests added or updated (happy path + at least one failure / edge case) per Testing Strategy
N/A: Diff-coverage gate not applicable — this PR is purely additive new test code (no production lines changed beyond a 10-line route stub in scripts/mock-api/routes/integrations.mjs).
N/A: behaviour-only change — no user-visible features added/removed/renamed, so the docs/TEST-COVERAGE-MATRIX.md doesn't need a new row.
N/A: no matrix feature IDs touched.
No new external network dependencies introduced (mock backend used per Testing Strategy)
N/A: no release-cut surface touched.
N/A: no linked issue.

Impact

Runtime: zero — test_support is #[cfg(test)]-gated, ships in dev builds only.
Mock backend: the keyword-driven LLM endpoint defaults to the original canned reply, so unrelated E2E specs are unaffected.
Performance / security / migration: none.

AI Authored PR Metadata (required for Codex/Linear PRs)

Linear Issue

Key: N/A
URL: N/A

Commit & Branch

Branch: feat/smarter-mock-harness-tests
Commit SHA: a64134b

Validation Run

pnpm --filter openhuman-app format:check
pnpm typecheck
Focused tests: cargo test --lib -- test_support_tests bughunt_tests (21 passed)
Rust fmt/check (if changed): cargo fmt --check + cargo check --manifest-path Cargo.toml
N/A: Tauri fmt/check — no Tauri shell changes

Validation Blocked

command: N/A
error: N/A
impact: N/A

Behavior Changes

Intended behavior change: none in production code; adds test infrastructure + tests only.
User-visible effect: none.

Parity Contract

Legacy behavior preserved: yes — JS mock default response unchanged, no harness code modified.
Guard/fallback/dispatch parity checks: N/A.

Duplicate / Superseded PR Handling

Duplicate PR(s): N/A
Canonical PR: this one
Resolution: N/A

Summary by CodeRabbit

New Features
- Added a mock LLM completion handler with a forced-response queue, keyword-driven scripted replies (including simulated tool calls), and configurable fallback content.
Tests
- Added extensive end-to-end and unit tests for the agent harness and tool-call loop, covering argument parsing, multi-tool chaining, truncation, error handling, permission gating, orchestration delegation, and a fake backend for integration-style tests.
- Added test utilities and fixtures to script provider behavior and simulate backends.
Chores
- Updated mock API routing so LLM completion requests are handled earlier by the new LLM handler.

…st utils Introduces `harness::test_support`, a #[cfg(test)] module providing: * `KeywordScriptedProvider` — a `Provider` that reacts to the rolling conversation state via keyword rules, supporting both prompt-guided (XML `<tool_call>`) and native OpenAI-style `tool_calls` surfaces. Records every turn for post-hoc assertion. * `spawn_fake_composio_backend` — a hermetic in-process axum server serving realistic Composio fixture data (gmail/notion/github/slack toolkits, connections, tools, and execute responses). Adds an initial 12-test behavioural suite (`test_support_tests`) that drives the real `run_tool_call_loop` against these utilities to cover: prompt-guided + native tool dispatch, multi-tool chaining, unknown tools, max-iteration guards, CliRpcOnly refusal, error propagation from tools, visibility whitelists, extra_tools, and a full agent -> ComposioActionTool -> fake backend round-trip.

…ases Nine targeted tests using the new KeywordScriptedProvider to surface behaviours that aren't covered by the existing harness suite: * Native tool_calls with JSON-encoded string args round-trip correctly. * DOCUMENTED QUIRK: `parse_arguments_value` silently swallows non-JSON string args and replaces them with `{}` — no signal to the LLM that its input was unparseable. Flagged for follow-up; the test pins the current behaviour so a fix lands deliberately. * Multiple <tool_call> blocks in a single assistant turn each execute. * Same-name tool collisions resolve to the first registry entry. * Markdown-fenced ```tool_call``` blocks parse correctly. * Native tool_calls take precedence over XML in the same response (no double-fire of the same logical call). * Per-tool `max_result_size_chars` caps history payload. * Empty response with no tool calls terminates cleanly. * Progress sink emits TurnStarted -> IterationStarted -> ToolCallStarted -> ToolCallCompleted -> TurnCompleted in order.

Replaces the static "Hello from e2e mock agent" stub with a smart mock LLM that mirrors the Rust-side KeywordScriptedProvider: * `llmKeywordRules` (JSON array of {keyword, content, toolCalls}) drives keyword-matched responses, including OpenAI-style native `tool_calls` payloads. * `llmForcedResponses` (JSON array) acts as a one-shot replay queue that takes precedence over keyword rules. * `llmFallbackContent` overrides the default final reply. Defaults to the original "Hello from e2e mock agent" content when nothing is configured, so existing E2E specs keep passing.

coderabbitai · 2026-05-14T11:18:57Z

📝 Walkthrough

Walkthrough

Refactors mock API routing by extracting OpenAI chat-completions into a dedicated LLM handler and registers it early; adds comprehensive agent-harness test support including a keyword-scripted provider, fake Composio backend, and extensive unit and end-to-end tests for tool dispatch.

Changes

Mock API LLM Route Refactoring

Layer / File(s)	Summary
LLM chat completions handler `scripts/mock-api/routes/llm.mjs`	Implements `handleLlmCompletions` with three routing tiers: forced-response queue, keyword-rule matching (extracting latest message text, lowercased substring matching), and fallback content. Helper functions extract probe text from messages, construct ChatCompletion responses with optional `tool_calls`, and build completion payloads.
Route registration and integrations cleanup `scripts/mock-api/server.mjs`, `scripts/mock-api/routes/integrations.mjs`	Imports and registers `handleLlmCompletions` ahead of the integrations catch-all, and removes the inline `POST /openai/v1/chat/completions` handling from integrations, documenting the delegation.

Agent Harness Testing Infrastructure and Suites

Layer / File(s)	Summary
Test module declarations `src/openhuman/agent/harness/mod.rs`	Adds `#[cfg(test)]` submodules: `bughunt_tests`, `pub(crate) test_support`, and `test_support_tests`.
KeywordScriptedProvider and test fixtures `src/openhuman/agent/harness/test_support.rs`	Adds `KeywordScriptedProvider`, `KeywordRule`, `ScriptedToolCall`, `ProviderTurn`, `ComposioFixture`, and `FakeComposioBackend`, plus `spawn_fake_composio_backend` to start an Axum fake backend and capture requests.
Tool dispatch unit tests `src/openhuman/agent/harness/bughunt_tests.rs`	Adds focused Tokio tests and test tools validating JSON argument decoding, non-JSON fallback, multi-tool dispatch, registry precedence, fenced markdown parsing, native `tool_calls` precedence, per-tool result truncation, empty assistant responses, and progress-sink lifecycle events.
End-to-end behavioral tests `src/openhuman/agent/harness/test_support_tests.rs`	Adds many end-to-end tests exercising `run_tool_call_loop`: scripted-provider flows, native `tool_calls`, chaining, unknown-tool handling, max-iteration guards, `ToolScope` permission behaviors, error propagation, `visible_tool_names` filtering, `extra_tools` reachability, and Composio-backed integration with request assertions including an orchestrator-to-delegation delegation chain test.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

tinyhumansai/openhuman#1488: Related orchestrator delegation and integrations execution tests that exercise delegation-tool routing and backend execute payloads.

Suggested reviewers

graycyrus

Poem

🐇 A rabbit scribbles tests by night,

Mock LLMs answer with scripted might,
Providers cue and backends fake,
Tool calls hop and results they make,
Hooray — the harness wakes!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 35.63% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately summarizes the main changes: adding a smart mock LLM provider, fake Composio backend, and 21 new tests for the harness.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Warning

Review ran into problems

🔥 Problems

Stopped waiting for pipeline failures after 30000ms. One of your pipelines takes longer than our 30000ms fetch window to run, so review may not consider pipeline-failure results for inline comments if any failures occurred after the fetch window. Increase the timeout if you want to wait longer or run a @coderabbit review after the pipeline has finished.

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/openhuman/agent/harness/bughunt_tests.rs`:
- Around line 65-437: The PR is failing CI due to rustfmt violations in this
test file (contains tests like
native_tool_call_decodes_json_encoded_arguments_string,
documents_silent_drop_of_non_json_arguments_string,
parallel_tool_calls_in_single_iteration_all_execute, etc.); fix by running
rustfmt on the crate (run cargo fmt --all) and re-running the Rust checks (cargo
check / cargo test) to ensure the formatting changes are committed, then update
the branch with the formatted file so the formatting errors are resolved.

In `@src/openhuman/agent/harness/test_support_tests.rs`:
- Around line 10-511: The file fails rustfmt checks—format the Rust test file
and re-run checks: run rustfmt (cargo fmt --all) or rustfmt on the crate, then
run cargo check (or cargo test) to ensure no formatting/type issues remain;
commit the formatted changes. Focus on this test module (symbols: RecordingTool,
CliOnlyTool, FailingTool, PanickyTool and test functions like
keyword_provider_drives_prompt_guided_tool_loop_to_completion,
keyword_provider_drives_native_tool_calls_path,
keyword_provider_chains_multiple_tools_across_iterations) so the diff only
contains whitespace/formatting fixes and CI passes.

In `@src/openhuman/agent/harness/test_support.rs`:
- Line 281: The new test_support module is not rustfmt-formatted; run rustfmt
(or cargo fmt) on the module (the test_support.rs file) to apply formatting
changes, then re-run CI checks (cargo check) to ensure the rustfmt diffs are
resolved before merging; commit the formatted file so CI passes.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 7cb9ee92-2c5e-411f-8206-2690568df119

📥 Commits

Reviewing files that changed from the base of the PR and between bf9404a and a64134b.

📒 Files selected for processing (7)

scripts/mock-api/routes/integrations.mjs
scripts/mock-api/routes/llm.mjs
scripts/mock-api/server.mjs
src/openhuman/agent/harness/bughunt_tests.rs
src/openhuman/agent/harness/mod.rs
src/openhuman/agent/harness/test_support.rs
src/openhuman/agent/harness/test_support_tests.rs

Adds an end-to-end test that proves the full chain a user-logged-in-via- Composio relies on: 1. `orchestrator::prompt::build` advertises the connected toolkit via the collapsed `delegate_to_integrations_agent` tool. 2. Given that system prompt and a user task mentioning gmail, the mock LLM emits a delegation tool call that satisfies the real SkillDelegationTool schema (`{toolkit: "gmail", prompt: ...}`). 3. A `TestDelegationTool` (test stand-in for SkillDelegationTool that skips the heavyweight sub-agent runner) runs a NESTED `run_tool_call_loop` for the integrations side — the same code path integrations_agent uses — with a real ComposioActionTool wired to the hermetic fake Composio backend. 4. The fake backend records the `/execute` call with the orchestrator-routed arguments (recipient_email, subject, body), and the final reply propagates back to the user.

senamakel added 3 commits May 14, 2026 03:25

senamakel requested a review from a team May 14, 2026 11:18

chore: cargo fmt new harness test files

a8ddeea

coderabbitai Bot requested changes May 14, 2026

View reviewed changes

Comment thread src/openhuman/agent/harness/bughunt_tests.rs

Comment thread src/openhuman/agent/harness/test_support_tests.rs

Comment thread src/openhuman/agent/harness/test_support.rs

coderabbitai Bot previously approved these changes May 14, 2026

View reviewed changes

senamakel dismissed coderabbitai[bot]’s stale review via 079076e May 14, 2026 11:26

coderabbitai Bot approved these changes May 14, 2026

View reviewed changes

senamakel merged commit 1c58d47 into tinyhumansai:main May 14, 2026
25 of 27 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(harness): smart mock LLM provider + fake Composio backend + 21 new tests#1729

test(harness): smart mock LLM provider + fake Composio backend + 21 new tests#1729
senamakel merged 5 commits into
tinyhumansai:mainfrom
senamakel:feat/smarter-mock-harness-tests

senamakel commented May 14, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented May 14, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Review ran into problems

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

senamakel commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Solution

Submission Checklist

Impact

Related

AI Authored PR Metadata (required for Codex/Linear PRs)

Linear Issue

Commit & Branch

Validation Run

Validation Blocked

Behavior Changes

Parity Contract

Duplicate / Superseded PR Handling

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Review ran into problems

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

senamakel commented May 14, 2026 •

edited

Loading

coderabbitai Bot commented May 14, 2026 •

edited

Loading