test: restructure test suite into four distinct execution tiers#128
Conversation
Added psycopg[binary]>=3.1 and pgvector>=0.3 to dev dependency group to ensure all unit tests can run during development and CI. This fixes the test collection error for test_postgres_backend.py while keeping postgres support optional for end users (via the pgvector optional dependency group).
📝 WalkthroughWalkthroughReplaces the Changes
Sequence Diagram(s)sequenceDiagram
participant Runner as Test Runner
participant Conftest as pytest conftest
participant Phoenix as Phoenix Server
participant CLI as evolve CLI subprocess
participant Agent as Agent subprocess
participant Verifier as Trace verifier subprocess
Note over Runner,Conftest: Session start / flag parsing
Runner->>Conftest: parse flags (--run-e2e?)
Conftest-->>Runner: adjust markexpr (remove "not e2e" if set)
Runner->>Phoenix: request phoenix_server fixture
Phoenix->>Phoenix: probe http://localhost:6006/status
alt Phoenix up
Phoenix-->>Runner: yield base URL
else Phoenix down
Phoenix->>CLI: run "evolve sync phoenix ..." (subprocess)
CLI-->>Phoenix: emit stdout lines (polled, accumulated)
Phoenix->>CLI: poll for ready status until timeout
CLI-->>Phoenix: yield base URL or fail on timeout
end
Runner->>Agent: run agent subprocess (timeout 90s)
Agent-->>Runner: produce output / exit
Runner->>Verifier: run trace verification (timeout 30s)
Verifier-->>Runner: success/fail
Runner->>Phoenix: teardown (wait 10s, kill if needed)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (4)
docs/LOW_CODE_TRACING.md (1)
249-250: Add-vto the documented pytest commands for consistency with repo test-run conventions.This is a docs-quality consistency improvement for test output clarity.
Based on learnings: Run pytest verbosely with the `-v` flag by default when testing.Proposed docs tweak
-uv run pytest -m e2e --run-e2e -s +uv run pytest -v -m e2e --run-e2e -s -uv run pytest tests/e2e/test_e2e_pipeline.py -k smolagents -m e2e --run-e2e -s +uv run pytest -v tests/e2e/test_e2e_pipeline.py -k smolagents -m e2e --run-e2e -s -uv run pytest tests/e2e/test_e2e_pipeline.py -k openai_agents -m e2e --run-e2e -s +uv run pytest -v tests/e2e/test_e2e_pipeline.py -k openai_agents -m e2e --run-e2e -sAlso applies to: 258-262
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/LOW_CODE_TRACING.md` around lines 249 - 250, Update the documented pytest commands (e.g., the occurrences of "uv run pytest -m e2e --run-e2e -s" and other pytest command examples in the same section) to include the -v flag so they read "uv run pytest -v -m e2e --run-e2e -s" (and similarly add -v to the other pytest command examples mentioned around the block) to match repository test-run conventions and produce verbose test output.tests/e2e/test_e2e_pipeline.py (3)
21-22: Move imports to the top of the file.
urllib.requestandurllib.errorare imported after the configuration block. Per Python style conventions (and Ruff enforcement), all imports should be grouped at the top of the file.♻️ Suggested fix
Move these imports to join the other imports at the top:
import subprocess import time import re import os import datetime +import urllib.request +import urllib.error import pytest from evolve.config.phoenix import phoenix_settingsThen remove lines 21-22.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/e2e/test_e2e_pipeline.py` around lines 21 - 22, Move the late imports urllib.request and urllib.error to the module import block at the top of tests/e2e/test_e2e_pipeline.py so they join the other imports; remove the duplicate import statements from their current position (the lines currently importing urllib.request and urllib.error) and run the linter to ensure import grouping/order is correct (references: the urllib.request and urllib.error imports).
198-200:text=Trueanduniversal_newlines=Trueare redundant.These parameters are aliases for the same behavior. Having both is harmless but unnecessary.
🧹 Suggested fix
process = subprocess.Popen( sync_command, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, # Merge stderr to monitor everything text=True, bufsize=1, - universal_newlines=True, )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/e2e/test_e2e_pipeline.py` around lines 198 - 200, The call sets both text=True and universal_newlines=True which are redundant aliases; remove universal_newlines=True (keep text=True for clarity) in the subprocess invocation (the call that also sets bufsize=1) so only one of these flags remains.
11-13: Remove unusedTIMESTAMPvariable.
TIMESTAMPis defined here but never used—the test generatescurrent_timestamplocally at line 102. This is dead code that should be removed.🧹 Suggested fix
# Configuration PHOENIX_URL = phoenix_settings.url -# Use a session-scope timestamp or generate per test? -# Per-test ensures no collisions even if run in parallel (though these should satisfy sequential) -TIMESTAMP = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/e2e/test_e2e_pipeline.py` around lines 11 - 13, Remove the unused top-level TIMESTAMP variable: delete the TIMESTAMP = datetime.datetime.now().strftime("%Y%m%d_%H%M%S") declaration so the module no longer contains dead code; tests already generate current_timestamp locally (see current_timestamp usage in the test function), so no other changes are required.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/LOW_CODE_TRACING.md`:
- Around line 203-205: Replace the documented CLI invocation "python -m
evolve.cli" with the entry-point command "evolve" wherever it appears (e.g., the
example command lines showing "uv run python -m evolve.cli sync phoenix ...");
the project exposes an "evolve" script in pyproject.toml and there is no
evolve.cli __main__.py, so update those occurrences (also the other instance
around lines 212–213) to use "uv run evolve" instead.
In `@README.md`:
- Around line 121-131: Update the "Unit Tests (Default)" section to accurately
reflect pytest defaults: either state that `uv run pytest` runs the default
selection (`-m "not llm and not e2e"` and therefore includes
`platform_integrations`), or add/replace the unit-only example with `uv run
pytest -m unit`; edit the heading/content mentioning "Unit Tests (Default)" and
the example command strings so readers know the difference between the default
`uv run pytest` behavior and the explicit unit-only `uv run pytest -m unit`.
In `@tests/e2e/test_e2e_pipeline.py`:
- Around line 24-31: The phoenix_server fixture currently yields
"http://localhost:6006" but tests read PHOENIX_URL from phoenix_settings.url,
causing a possible mismatch; update the fixture phoenix_server so tests actually
use its URL by either removing autouse=True and having tests accept
phoenix_server as a parameter, or keep autouse and set phoenix_settings.url (or
PHOENIX_URL) to the yielded URL inside phoenix_server before yield, or update
tests to accept phoenix_server and use that returned URL instead of
phoenix_settings.url; locate the phoenix_server fixture and phoenix_settings.url
/ PHOENIX_URL references to implement one of these fixes consistently.
---
Nitpick comments:
In `@docs/LOW_CODE_TRACING.md`:
- Around line 249-250: Update the documented pytest commands (e.g., the
occurrences of "uv run pytest -m e2e --run-e2e -s" and other pytest command
examples in the same section) to include the -v flag so they read "uv run pytest
-v -m e2e --run-e2e -s" (and similarly add -v to the other pytest command
examples mentioned around the block) to match repository test-run conventions
and produce verbose test output.
In `@tests/e2e/test_e2e_pipeline.py`:
- Around line 21-22: Move the late imports urllib.request and urllib.error to
the module import block at the top of tests/e2e/test_e2e_pipeline.py so they
join the other imports; remove the duplicate import statements from their
current position (the lines currently importing urllib.request and urllib.error)
and run the linter to ensure import grouping/order is correct (references: the
urllib.request and urllib.error imports).
- Around line 198-200: The call sets both text=True and universal_newlines=True
which are redundant aliases; remove universal_newlines=True (keep text=True for
clarity) in the subprocess invocation (the call that also sets bufsize=1) so
only one of these flags remains.
- Around line 11-13: Remove the unused top-level TIMESTAMP variable: delete the
TIMESTAMP = datetime.datetime.now().strftime("%Y%m%d_%H%M%S") declaration so the
module no longer contains dead code; tests already generate current_timestamp
locally (see current_timestamp usage in the test function), so no other changes
are required.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 1d353ec9-61a8-4fba-933d-d0573b8caf62
⛔ Files ignored due to path filters (1)
uv.lockis excluded by!**/*.lock
📒 Files selected for processing (10)
README.mddocs/LOW_CODE_TRACING.mdpyproject.tomltests/conftest.pytests/e2e/test_e2e_pipeline.pytests/unit/test_cli.pytests/unit/test_extract_trajectories.pytests/unit/test_mcp_server.pytests/unit/test_phoenix_sync.pytests/unit/test_tracing.py
💤 Files with no reviewable changes (1)
- tests/unit/test_cli.py
There was a problem hiding this comment.
🧹 Nitpick comments (1)
README.md (1)
128-159: Tier count mismatch: "4 tiers" but 5 items listed.Line 128 states "4 cleanly isolated tiers" but the numbered list contains 5 items. The "Default Local Suite" is a convenience combo (unit + platform_integrations), not a distinct tier.
Consider either:
- Changing "4" to "5" if you want to count the default combo as its own tier, or
- Restructuring to clarify that there are 4 markers (
unit,platform_integrations,e2e,llm) and a default selection that combines two of them.📝 Suggested fix
-The test suite is organized into 4 cleanly isolated tiers depending on infrastructure requirements: +The test suite is organized into 4 pytest markers (`unit`, `platform_integrations`, `e2e`, `llm`) depending on infrastructure requirements: 1. **Default Local Suite** - Runs both fast logic tests (`unit`) and filesystem script verifications (`platform_integrations`). + Runs both `unit` and `platform_integrations` tests (the default marker expression). ```bash uv run pytest ```🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@README.md` around lines 128 - 159, The README header incorrectly says "4 cleanly isolated tiers" while the list shows five items; update the text to either say "5 cleanly isolated tiers" or reword to state "4 primary tiers with a default local suite that combines unit and platform_integrations" and clarify the four markers as `unit`, `platform_integrations`, `e2e`, and `llm`; locate the paragraph describing the test suite (the "Default Local Suite" and numbered list) and change the count or rephrase to reflect that the default entry is a convenience combo rather than an additional tier.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@README.md`:
- Around line 128-159: The README header incorrectly says "4 cleanly isolated
tiers" while the list shows five items; update the text to either say "5 cleanly
isolated tiers" or reword to state "4 primary tiers with a default local suite
that combines unit and platform_integrations" and clarify the four markers as
`unit`, `platform_integrations`, `e2e`, and `llm`; locate the paragraph
describing the test suite (the "Default Local Suite" and numbered list) and
change the count or rephrase to reflect that the default entry is a convenience
combo rather than an additional tier.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: df1fd93b-0ab3-4b87-a791-38e23bb7d13b
📒 Files selected for processing (4)
README.mdREADME_phoenix_sync.mddocs/LOW_CODE_TRACING.mdtests/e2e/test_e2e_pipeline.py
✅ Files skipped from review due to trivial changes (1)
- README_phoenix_sync.md
🚧 Files skipped from review as they are similar to previous changes (1)
- docs/LOW_CODE_TRACING.md
Background:
The existing test infrastructure suffered from "unknown mark" Pytest warnings, CLI timeout hangs during heavy testing, and a lack of explicit categorization between core logic vs. local infrastructure operations.
What Changed:
pyproject.toml(no more warnings):unit(246 tests) — Core logic + offline Phoenix testing.platform_integrations(18 tests) — Fast filesystem local installation tests.llm(62 tests) — Active remote LLM inference pipelines.e2e(12 tests) — Heavy infrastructure testing simulating full workflows.phoenix_serverfixture in the E2E suite to autonomously spin up and tear down a background observability server during tests.evolve syncCLI test "phantom hangs" by capturing and aggregating subprocess logs directly, ensuring error stack traces are printed correctly.README.mdanddocs/LOW_CODE_TRACING.mdto reflect the new architecture, removing the obsolete--run-phoenixexecution commands.Summary by CodeRabbit
Documentation
Chores
Tests