test(e2e): fix agent timeouts by delaying test execution until Phoenix tracing endpoints are fully bound#202
Conversation
|
Warning Rate limit exceeded
Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 19 minutes and 48 seconds. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughA 5-second delay was added to the e2e test pipeline immediately after the Phoenix Changes
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~2 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tests/e2e/test_e2e_pipeline.py (1)
28-56:⚠️ Potential issue | 🟠 MajorProbe OTLP readiness in both Phoenix paths to ensure trace receivers are bound before tests run.
The pre-existing server path (line 30) yields immediately after
/statussucceeds, while the locally started path (line 56) sleeps 5 seconds. Both paths have the same race:/statusresponds before the OTLP trace receiver endpoint is fully bound. Additionally, 5 seconds is arbitrary and can be insufficient on slow CI runners or wasteful on fast ones.Poll the actual OTLP HTTP endpoint (
/v1/traces) with a timeout-bounded retry loop in both paths. The documentation confirms the endpoint ishttp://localhost:6006/v1/traces.Example fix
+def _wait_for_phoenix_otlp_ready(base_url: str, timeout: float = 15.0) -> None: + deadline = time.monotonic() + timeout + last_error = None + + while time.monotonic() < deadline: + try: + request = urllib.request.Request( + f"{base_url}/v1/traces", + data=b'{"resourceSpans":[]}', + headers={"Content-Type": "application/json"}, + method="POST", + ) + with urllib.request.urlopen(request, timeout=2): + return + except urllib.error.HTTPError as exc: + if exc.code in {400, 405, 415}: + return + last_error = exc + except (urllib.error.URLError, ConnectionError) as exc: + last_error = exc + + time.sleep(0.25) + + pytest.fail(f"Phoenix OTLP receiver was not ready within {timeout}s: {last_error}") + + `@pytest.fixture`(scope="session", autouse=True) def phoenix_server(): @@ urllib.request.urlopen("http://localhost:6006/status", timeout=2) print("\nPhoenix is already running on port 6006.") + _wait_for_phoenix_otlp_ready("http://localhost:6006") yield "http://localhost:6006" return @@ urllib.request.urlopen("http://localhost:6006/status", timeout=2) print("Phoenix server is up and running.") - time.sleep(5) # Wait for OTLP receivers to fully spin up + _wait_for_phoenix_otlp_ready("http://localhost:6006") break🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/e2e/test_e2e_pipeline.py` around lines 28 - 56, The /status check can return before the OTLP trace receiver is bound; update both code paths that call urllib.request.urlopen("http://localhost:6006/status", ...) (the pre-existing-server path that yields immediately and the locally-started path that currently does time.sleep(5) after confirming /status) to instead poll the OTLP HTTP endpoint "http://localhost:6006/v1/traces" with a bounded retry loop and short per-request timeout, failing the test setup if the trace endpoint does not become responsive within the retry window; keep the existing max_retries/timeout semantics but replace the unconditional yield/sleep with the polling logic so both the pre-existing server and the proc-started server wait for OTLP readiness before proceeding.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@tests/e2e/test_e2e_pipeline.py`:
- Around line 28-56: The /status check can return before the OTLP trace receiver
is bound; update both code paths that call
urllib.request.urlopen("http://localhost:6006/status", ...) (the
pre-existing-server path that yields immediately and the locally-started path
that currently does time.sleep(5) after confirming /status) to instead poll the
OTLP HTTP endpoint "http://localhost:6006/v1/traces" with a bounded retry loop
and short per-request timeout, failing the test setup if the trace endpoint does
not become responsive within the retry window; keep the existing
max_retries/timeout semantics but replace the unconditional yield/sleep with the
polling logic so both the pre-existing server and the proc-started server wait
for OTLP readiness before proceeding.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 4e65c71e-d2b9-473b-a07c-51a14ea0d1e7
📒 Files selected for processing (1)
tests/e2e/test_e2e_pipeline.py
7f910c1 to
fcd8127
Compare
|
is the e2e pipeline run in the CI? |
Context & Issue
The
e2etest pipeline was consistently timing out (and failing at the 90-second limit) specifically for thesmolagentstest, leading to cascading delays for developers executing local tests.Root Cause
A race condition occurred during the startup of the
phoenix_serverfixture. The tests aggressively proceeded the moment the Phoenix/statusweb endpoint reported a200 OK. However, the underlying OpenTelemetry tracing receivers (/v1/traces) weren't fully bound yet. Becausesmolagentsis the first agent to run, its rapid script completion (in around 1.8 seconds) resulted in immediate span exports that bounced with aConnection refusederror.To handle this, OpenTelemetry enters an aggressive asynchronous retry loop that blocked the thread from shutting down down natively, ultimately causing the script to hang until the overarching 90-second test suite boundary aggressively killed it. (Subsequent tests trivially succeeded because Phoenix had 90 extra seconds to start parsing endpoints while
smolagentswas frozen).Solution
Added a simple
time.sleep(5)immediately after detecting200 OKon/statusinsidetests/e2e/test_e2e_pipeline.py. This provides adequate breathing room for Phoenix to establish and expose its full set of gRPC/HTTP span endpoints prior tosmolagentsbeginning its trace injection. Validated fully that all 12 pipeline tests now complete efficiently.Summary by CodeRabbit