Skip to content

test: restructure test suite into four distinct execution tiers#128

Merged
visahak merged 6 commits into
AgentToolkit:mainfrom
visahak:test/restructure-test-suite
Apr 2, 2026
Merged

test: restructure test suite into four distinct execution tiers#128
visahak merged 6 commits into
AgentToolkit:mainfrom
visahak:test/restructure-test-suite

Conversation

@visahak
Copy link
Copy Markdown
Collaborator

@visahak visahak commented Apr 2, 2026

Background:
The existing test infrastructure suffered from "unknown mark" Pytest warnings, CLI timeout hangs during heavy testing, and a lack of explicit categorization between core logic vs. local infrastructure operations.
What Changed:

  • Tiered Architecture: Reorganized the test suite into exactly four non-overlapping tiers natively registered in pyproject.toml (no more warnings):
    • unit (246 tests) — Core logic + offline Phoenix testing.
    • platform_integrations (18 tests) — Fast filesystem local installation tests.
    • llm (62 tests) — Active remote LLM inference pipelines.
    • e2e (12 tests) — Heavy infrastructure testing simulating full workflows.
  • Infrastructure Stability: Added a robust phoenix_server fixture in the E2E suite to autonomously spin up and tear down a background observability server during tests.
  • Better Error Logging: Solved the evolve sync CLI test "phantom hangs" by capturing and aggregating subprocess logs directly, ensuring error stack traces are printed correctly.
  • Documentation: Updated README.md and docs/LOW_CODE_TRACING.md to reflect the new architecture, removing the obsolete --run-phoenix execution commands.

Summary by CodeRabbit

  • Documentation

    • Updated test-running docs to a marker/flag-based, multi-tier system (unit, platform integration, end-to-end, LLM) and updated tracing/CLI examples to the current CLI entrypoint.
  • Chores

    • Added pgvector and psycopg dev tooling dependencies; adjusted default test-select behavior to exclude LLM and E2E by marker.
  • Tests

    • Reworked test markers and flags (removed Phoenix marker, added platform integration/end-to-end flow) and improved E2E test robustness and timeouts.

visahak added 2 commits April 1, 2026 16:41
Added psycopg[binary]>=3.1 and pgvector>=0.3 to dev dependency group to ensure all unit tests can run during development and CI.

This fixes the test collection error for test_postgres_backend.py while keeping postgres support optional for end users (via the pgvector optional dependency group).
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 2, 2026

📝 Walkthrough

Walkthrough

Replaces the phoenix marker/flag with a tiered pytest marker model (unit, platform_integrations, e2e, llm), updates CLI examples to evolve ..., and enhances E2E tests with an autostarting Phoenix subprocess, stricter subprocess timeouts, and improved sync output/polling logic.

Changes

Cohort / File(s) Summary
Docs
README.md, README_phoenix_sync.md, docs/LOW_CODE_TRACING.md
Rework test-running docs to use pytest marker tiers and replace Python module CLI examples with the evolve entrypoint for Phoenix sync and guideline entity listing. Replace env-var gated E2E examples with -m e2e --run-e2e.
Project Config
pyproject.toml
Add dev deps pgvector>=0.3, psycopg[binary]>=3.1; change pytest addopts default from -m 'not phoenix and not llm' to -m 'not llm and not e2e'; remove phoenix marker, add platform_integrations.
Test Harness
tests/conftest.py
Replace --run-phoenix with --run-e2e option and update pytest_configure logic to remove not e2e from markexpr when flagged; always assign modified markexpr back to config.option.markexpr.
E2E Tests
tests/e2e/test_e2e_pipeline.py
Introduce session-scoped phoenix_server fixture that probes localhost:6006 and conditionally launches a local Phoenix subprocess; switch sync invocation to evolve sync phoenix ...; add subprocess timeouts (agent 90s, trace check 30s); improve sync stdout accumulation, polling loop behavior, and termination grace period (10s).
Unit Test Marker Migration
tests/unit/test_cli.py, tests/unit/test_extract_trajectories.py, tests/unit/test_mcp_server.py, tests/unit/test_phoenix_sync.py, tests/unit/test_tracing.py
Migrate tests previously marked phoenix to unit (module-level pytestmark or decorator updates) to align with new marker taxonomy.
Unit Fixture Adjustment
tests/unit/test_client.py
Change shared fixture to construct EvolveClient with EvolveConfig(backend="filesystem") instead of bare EvolveClient().

Sequence Diagram(s)

sequenceDiagram
  participant Runner as Test Runner
  participant Conftest as pytest conftest
  participant Phoenix as Phoenix Server
  participant CLI as evolve CLI subprocess
  participant Agent as Agent subprocess
  participant Verifier as Trace verifier subprocess

  Note over Runner,Conftest: Session start / flag parsing
  Runner->>Conftest: parse flags (--run-e2e?)
  Conftest-->>Runner: adjust markexpr (remove "not e2e" if set)

  Runner->>Phoenix: request phoenix_server fixture
  Phoenix->>Phoenix: probe http://localhost:6006/status
  alt Phoenix up
    Phoenix-->>Runner: yield base URL
  else Phoenix down
    Phoenix->>CLI: run "evolve sync phoenix ..." (subprocess)
    CLI-->>Phoenix: emit stdout lines (polled, accumulated)
    Phoenix->>CLI: poll for ready status until timeout
    CLI-->>Phoenix: yield base URL or fail on timeout
  end

  Runner->>Agent: run agent subprocess (timeout 90s)
  Agent-->>Runner: produce output / exit
  Runner->>Verifier: run trace verification (timeout 30s)
  Verifier-->>Runner: success/fail

  Runner->>Phoenix: teardown (wait 10s, kill if needed)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested reviewers

  • vinodmut
  • gaodan-fang
  • illeatmyhat

Poem

🐰 Hopping through markers, tidy and new,
I wake Phoenix up when it hides from view,
Timeouts keep watch, outputs I store,
Syncs and traces chatter, evermore —
Tests bound forward, neat and true! 🥕

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 71.43% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely summarizes the main change: restructuring the test suite into four distinct execution tiers (unit, platform_integrations, llm, e2e), which is the core objective across all modified files.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (4)
docs/LOW_CODE_TRACING.md (1)

249-250: Add -v to the documented pytest commands for consistency with repo test-run conventions.

This is a docs-quality consistency improvement for test output clarity.

Proposed docs tweak
-uv run pytest -m e2e --run-e2e -s
+uv run pytest -v -m e2e --run-e2e -s

-uv run pytest tests/e2e/test_e2e_pipeline.py -k smolagents -m e2e --run-e2e -s
+uv run pytest -v tests/e2e/test_e2e_pipeline.py -k smolagents -m e2e --run-e2e -s

-uv run pytest tests/e2e/test_e2e_pipeline.py -k openai_agents -m e2e --run-e2e -s
+uv run pytest -v tests/e2e/test_e2e_pipeline.py -k openai_agents -m e2e --run-e2e -s
Based on learnings: Run pytest verbosely with the `-v` flag by default when testing.

Also applies to: 258-262

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/LOW_CODE_TRACING.md` around lines 249 - 250, Update the documented
pytest commands (e.g., the occurrences of "uv run pytest -m e2e --run-e2e -s"
and other pytest command examples in the same section) to include the -v flag so
they read "uv run pytest -v -m e2e --run-e2e -s" (and similarly add -v to the
other pytest command examples mentioned around the block) to match repository
test-run conventions and produce verbose test output.
tests/e2e/test_e2e_pipeline.py (3)

21-22: Move imports to the top of the file.

urllib.request and urllib.error are imported after the configuration block. Per Python style conventions (and Ruff enforcement), all imports should be grouped at the top of the file.

♻️ Suggested fix

Move these imports to join the other imports at the top:

 import subprocess
 import time
 import re
 import os
 import datetime
+import urllib.request
+import urllib.error
 import pytest
 from evolve.config.phoenix import phoenix_settings

Then remove lines 21-22.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/e2e/test_e2e_pipeline.py` around lines 21 - 22, Move the late imports
urllib.request and urllib.error to the module import block at the top of
tests/e2e/test_e2e_pipeline.py so they join the other imports; remove the
duplicate import statements from their current position (the lines currently
importing urllib.request and urllib.error) and run the linter to ensure import
grouping/order is correct (references: the urllib.request and urllib.error
imports).

198-200: text=True and universal_newlines=True are redundant.

These parameters are aliases for the same behavior. Having both is harmless but unnecessary.

🧹 Suggested fix
     process = subprocess.Popen(
         sync_command,
         stdout=subprocess.PIPE,
         stderr=subprocess.STDOUT,  # Merge stderr to monitor everything
         text=True,
         bufsize=1,
-        universal_newlines=True,
     )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/e2e/test_e2e_pipeline.py` around lines 198 - 200, The call sets both
text=True and universal_newlines=True which are redundant aliases; remove
universal_newlines=True (keep text=True for clarity) in the subprocess
invocation (the call that also sets bufsize=1) so only one of these flags
remains.

11-13: Remove unused TIMESTAMP variable.

TIMESTAMP is defined here but never used—the test generates current_timestamp locally at line 102. This is dead code that should be removed.

🧹 Suggested fix
 # Configuration
 PHOENIX_URL = phoenix_settings.url
-# Use a session-scope timestamp or generate per test?
-# Per-test ensures no collisions even if run in parallel (though these should satisfy sequential)
-TIMESTAMP = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/e2e/test_e2e_pipeline.py` around lines 11 - 13, Remove the unused
top-level TIMESTAMP variable: delete the TIMESTAMP =
datetime.datetime.now().strftime("%Y%m%d_%H%M%S") declaration so the module no
longer contains dead code; tests already generate current_timestamp locally (see
current_timestamp usage in the test function), so no other changes are required.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/LOW_CODE_TRACING.md`:
- Around line 203-205: Replace the documented CLI invocation "python -m
evolve.cli" with the entry-point command "evolve" wherever it appears (e.g., the
example command lines showing "uv run python -m evolve.cli sync phoenix ...");
the project exposes an "evolve" script in pyproject.toml and there is no
evolve.cli __main__.py, so update those occurrences (also the other instance
around lines 212–213) to use "uv run evolve" instead.

In `@README.md`:
- Around line 121-131: Update the "Unit Tests (Default)" section to accurately
reflect pytest defaults: either state that `uv run pytest` runs the default
selection (`-m "not llm and not e2e"` and therefore includes
`platform_integrations`), or add/replace the unit-only example with `uv run
pytest -m unit`; edit the heading/content mentioning "Unit Tests (Default)" and
the example command strings so readers know the difference between the default
`uv run pytest` behavior and the explicit unit-only `uv run pytest -m unit`.

In `@tests/e2e/test_e2e_pipeline.py`:
- Around line 24-31: The phoenix_server fixture currently yields
"http://localhost:6006" but tests read PHOENIX_URL from phoenix_settings.url,
causing a possible mismatch; update the fixture phoenix_server so tests actually
use its URL by either removing autouse=True and having tests accept
phoenix_server as a parameter, or keep autouse and set phoenix_settings.url (or
PHOENIX_URL) to the yielded URL inside phoenix_server before yield, or update
tests to accept phoenix_server and use that returned URL instead of
phoenix_settings.url; locate the phoenix_server fixture and phoenix_settings.url
/ PHOENIX_URL references to implement one of these fixes consistently.

---

Nitpick comments:
In `@docs/LOW_CODE_TRACING.md`:
- Around line 249-250: Update the documented pytest commands (e.g., the
occurrences of "uv run pytest -m e2e --run-e2e -s" and other pytest command
examples in the same section) to include the -v flag so they read "uv run pytest
-v -m e2e --run-e2e -s" (and similarly add -v to the other pytest command
examples mentioned around the block) to match repository test-run conventions
and produce verbose test output.

In `@tests/e2e/test_e2e_pipeline.py`:
- Around line 21-22: Move the late imports urllib.request and urllib.error to
the module import block at the top of tests/e2e/test_e2e_pipeline.py so they
join the other imports; remove the duplicate import statements from their
current position (the lines currently importing urllib.request and urllib.error)
and run the linter to ensure import grouping/order is correct (references: the
urllib.request and urllib.error imports).
- Around line 198-200: The call sets both text=True and universal_newlines=True
which are redundant aliases; remove universal_newlines=True (keep text=True for
clarity) in the subprocess invocation (the call that also sets bufsize=1) so
only one of these flags remains.
- Around line 11-13: Remove the unused top-level TIMESTAMP variable: delete the
TIMESTAMP = datetime.datetime.now().strftime("%Y%m%d_%H%M%S") declaration so the
module no longer contains dead code; tests already generate current_timestamp
locally (see current_timestamp usage in the test function), so no other changes
are required.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 1d353ec9-61a8-4fba-933d-d0573b8caf62

📥 Commits

Reviewing files that changed from the base of the PR and between 824e4d9 and 89b7d63.

⛔ Files ignored due to path filters (1)
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (10)
  • README.md
  • docs/LOW_CODE_TRACING.md
  • pyproject.toml
  • tests/conftest.py
  • tests/e2e/test_e2e_pipeline.py
  • tests/unit/test_cli.py
  • tests/unit/test_extract_trajectories.py
  • tests/unit/test_mcp_server.py
  • tests/unit/test_phoenix_sync.py
  • tests/unit/test_tracing.py
💤 Files with no reviewable changes (1)
  • tests/unit/test_cli.py

Comment thread docs/LOW_CODE_TRACING.md Outdated
Comment thread README.md Outdated
Comment thread tests/e2e/test_e2e_pipeline.py
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
README.md (1)

128-159: Tier count mismatch: "4 tiers" but 5 items listed.

Line 128 states "4 cleanly isolated tiers" but the numbered list contains 5 items. The "Default Local Suite" is a convenience combo (unit + platform_integrations), not a distinct tier.

Consider either:

  • Changing "4" to "5" if you want to count the default combo as its own tier, or
  • Restructuring to clarify that there are 4 markers (unit, platform_integrations, e2e, llm) and a default selection that combines two of them.
📝 Suggested fix
-The test suite is organized into 4 cleanly isolated tiers depending on infrastructure requirements:
+The test suite is organized into 4 pytest markers (`unit`, `platform_integrations`, `e2e`, `llm`) depending on infrastructure requirements:

 1. **Default Local Suite**
-   Runs both fast logic tests (`unit`) and filesystem script verifications (`platform_integrations`).
+   Runs both `unit` and `platform_integrations` tests (the default marker expression).
    ```bash
    uv run pytest
    ```
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@README.md` around lines 128 - 159, The README header incorrectly says "4
cleanly isolated tiers" while the list shows five items; update the text to
either say "5 cleanly isolated tiers" or reword to state "4 primary tiers with a
default local suite that combines unit and platform_integrations" and clarify
the four markers as `unit`, `platform_integrations`, `e2e`, and `llm`; locate
the paragraph describing the test suite (the "Default Local Suite" and numbered
list) and change the count or rephrase to reflect that the default entry is a
convenience combo rather than an additional tier.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@README.md`:
- Around line 128-159: The README header incorrectly says "4 cleanly isolated
tiers" while the list shows five items; update the text to either say "5 cleanly
isolated tiers" or reword to state "4 primary tiers with a default local suite
that combines unit and platform_integrations" and clarify the four markers as
`unit`, `platform_integrations`, `e2e`, and `llm`; locate the paragraph
describing the test suite (the "Default Local Suite" and numbered list) and
change the count or rephrase to reflect that the default entry is a convenience
combo rather than an additional tier.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: df1fd93b-0ab3-4b87-a791-38e23bb7d13b

📥 Commits

Reviewing files that changed from the base of the PR and between 52f1ed3 and 5b43259.

📒 Files selected for processing (4)
  • README.md
  • README_phoenix_sync.md
  • docs/LOW_CODE_TRACING.md
  • tests/e2e/test_e2e_pipeline.py
✅ Files skipped from review due to trivial changes (1)
  • README_phoenix_sync.md
🚧 Files skipped from review as they are similar to previous changes (1)
  • docs/LOW_CODE_TRACING.md

@visahak visahak merged commit f2bc745 into AgentToolkit:main Apr 2, 2026
16 checks passed
@visahak visahak deleted the test/restructure-test-suite branch April 27, 2026 15:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants