feat: Fleet client#256
Conversation
Darktex
left a comment
There was a problem hiding this comment.
Note: This is an automated review by Claude Code (alignment-reviewer agent), not a human review. The account posting this is shared with the human maintainer.
Now let me produce the comprehensive two-tier alignment review:
PR #256: Fleet Client - Alignment Review
Executive Summary
Tier 1 Status:
Tier 2 Status:
This PR introduces Fleet integration following RFC 003's MCP tool-call architecture. While the implementation demonstrates solid understanding of the dual-interface model (HTTP for orchestration, MCP for agent actions), there are critical issues that must be addressed before merge.
Tier 1: Bugs, Security, and Code Quality
🔴 Critical: Missing Dependency
Location: src/envs/fleet_env/client.py:20
Issue: The code imports HTTPEnvClient which doesn't exist in the codebase. Based on the existing pattern (EnvClient in src/openenv/core/env_client.py), this appears to be a non-existent module.
from openenv_core.http_env_client import HTTPEnvClientEvidence: Grep search for HTTPEnvClient shows no definition:
$ grep -r "class HTTPEnvClient" src/
# No resultsThe existing client is EnvClient (WebSocket-based), not HTTPEnvClient.
Impact: The code will fail at import time. This suggests the PR was not tested in the current environment structure.
Fix Required:
- Either implement
HTTPEnvClientfollowing existingEnvClientpatterns, OR - Refactor
FleetEnvClientto inherit fromEnvClientand use WebSocket, OR - Document why HTTP-only is required for Fleet and implement the base class
⚠️ Lint Check Failed
The automated lint check failed due to missing uv:
Error: 'uv' is not installed or not in PATH
Required Action: The PR author needs to run:
uv run ruff format src/envs/fleet_env/ examples/fleet_env_example.py tests/envs/test_fleet_env.py✅ Security: No Issues Found
- No credentials in code
- Proper authorization header handling
- No command injection risks
- Appropriate use of environment variables for API keys
✅ Debug Code: Acceptable
The check-debug.sh hook flagged print statements in examples/fleet_env_example.py, but these are acceptable as:
- It's in
examples/(notsrc/) - The prints are intentional user-facing output for the example
- They help users understand the flow
Tier 2: Alignment with Principles and Invariants
🚨 ALIGNMENT FLAG #1: Dual API Boundary Violation
Invariant at Risk: Architectural Invariants → Dual API boundary (INVARIANTS.md lines 41-58)
The Concern: FleetEnvClient.step() explicitly rejects MCP tool actions:
# src/envs/fleet_env/client.py:125-130
def step(self, action: Action) -> StepResult[Observation]:
# Enforce separation: agent actions are MCP-only (use FleetMCPTools).
if isinstance(action, (ListToolsAction, CallToolAction)):
raise TypeError(
"Agent tool actions are MCP-only. Use FleetMCPTools.list_tools()/call_tool()."
)Why This Violates the Invariant:
Per RFC 001 and INVARIANTS.md, the dual API boundary is:
- Agent Interface (MCP): Tools the agent uses
- Infrastructure Interface (Gym-like):
reset(),step(),state()for orchestration
However, RFC 003 explicitly defines ListToolsAction and CallToolAction as Gym-style actions that wrap MCP calls:
# From RFC 003, lines 476-486
@dataclass
class ListToolsAction(Action):
"""Request list of available tools from MCP servers."""
pass
@dataclass
class CallToolAction(Action):
"""Call a specific tool via MCP."""
tool_name: str
parameters: Dict[str, Any]The Design Intent (RFC 003, line 90):
# Training/Eval: Use step API
POST http://localhost:8000/step
{"action": {...}}
# Production/Inference: Use MCP API
POST http://localhost:8000/mcp
{"jsonrpc": "2.0", "method": "tools/list", "id": 1}The RFC shows both paths are valid: step-based (training) and direct MCP (inference).
The Problem: Fleet client forces users to use FleetMCPTools (direct MCP) and blocks the Gym-style action interface. This creates an environment that cannot be used in standard RL training loops that expect env.step(ListToolsAction()) to work.
Suggested Resolution:
-
Option A (Recommended): Support both interfaces as RFC 003 intends:
env.step(ListToolsAction())→ calls MCP internally, returns observationtools.list_tools()→ direct MCP access for inference
-
Option B: If Fleet truly cannot support Gym-style actions, document this as a known limitation and explain why Fleet environments are "inference-only" (violating OpenEnv's principle of identical training/production interfaces).
Reviewer: @Darktex (RFC 003 author)
🚨 ALIGNMENT FLAG #2: Missing Reward Computation
Principle at Risk: "Rewards inside environment" (PRINCIPLES.md line 31, RFC 002)
The Concern: FleetEnvClient._parse_result() constructs observations but doesn't compute rewards:
# src/envs/fleet_env/client.py:108-115
return StepResult(
observation=Observation(
metadata=obs_payload,
reward=payload.get("reward"), # ← Passthrough, not computed
done=payload.get("done", False),
),
reward=payload.get("reward"),
done=payload.get("done", False),
)Why This Matters:
Per PRINCIPLES.md (line 31) and INVARIANTS.md (lines 64-67):
Rewards in environment: Reward computation must stay inside environment boundary
The client is simply passing through whatever payload.get("reward") returns from Fleet. If Fleet doesn't compute rewards (or computes them incorrectly for the task), there's no mechanism to override or augment.
Questions for Clarification:
- Does the Fleet server compute task-specific rewards?
- If not, where should reward computation happen for Fleet environments?
- Should
FleetEnvClientsupport a reward transform pipeline?
Current State: The README (src/envs/fleet_env/README.md) doesn't mention reward computation at all, suggesting this is unresolved.
Suggested Resolution:
- Document Fleet's reward computation model in the README
- If Fleet doesn't compute rewards, add a
reward_fnparameter toFleetEnvClient.__init__() - Consider whether Fleet environments are suitable for RL training without reward computation
Reviewer: @Darktex
⚠️ Minor: Client-Server Separation
Pattern Compliance: The code properly separates client and server concerns:
- ✅
models.pydefines shared types - ✅ No client imports from
server/ - ✅ No server directory (Fleet is external)
However, there's a structural inconsistency: Traditional OpenEnv environments have:
my_env/
├── client.py
├── models.py
└── server/
Fleet has:
fleet_env/
├── client.py
├── mcp_tools.py
├── fleet_mcp_client.py
├── models.py
└── README.md # No server/ because it's remote
Recommendation: Add a comment in __init__.py explaining why there's no server/ directory:
"""Fleet Environment - client-side adapter for Fleet-hosted MCP environments.
Note: Unlike standard OpenEnv environments, fleet_env has no server/ directory
because the environment runs remotely on Fleet infrastructure. This package
only provides client-side adapters.
"""✅ Strengths
- Excellent documentation: The README is comprehensive and well-structured
- Type safety: Proper use of dataclasses and type hints
- Error handling: Good separation of discovery vs execution errors in MCP tools
- Testing: Thoughtful use of fakes to avoid external dependencies
- Dual-plane awareness: The code clearly understands orchestration vs agent boundaries
Summary of Required Actions
Before Merge (Blockers):
-
Fix missing
HTTPEnvClient(Tier 1 bug)- Either implement it or refactor to use
EnvClient
- Either implement it or refactor to use
-
Resolve ALIGNMENT FLAG #1 (Dual API boundary)
- Decide: Support both Gym-style actions AND direct MCP, or document as inference-only
-
Run lint checks
- Execute:
uv run ruff format ...
- Execute:
Strongly Recommended:
-
Resolve ALIGNMENT FLAG #2 (Reward computation)
- Document Fleet's reward model or implement client-side rewards
-
Add architectural note to
__init__.pyabout missingserver/
Test Results
The test file (tests/envs/test_fleet_env.py) uses appropriate faking strategies and covers:
- ✅ Provisioning from Fleet
- ✅ HTTP orchestration (reset)
- ✅ MCP tool discovery
- ✅ Tool routing across multiple endpoints
However, the tests will fail due to the missing HTTPEnvClient import.
Conclusion
This PR demonstrates strong architectural understanding and solid engineering practices. The two critical alignment flags stem from genuine design tensions between:
- RFC 003's vision of unified Gym+MCP actions
- Fleet's separate HTTP/MCP interface design
These tensions need human review to determine the right path forward. Once resolved, this will be a valuable addition to OpenEnv's runtime ecosystem.
Recommendation: Request clarification from RFC authors before merging.
Automated review by Claude Code | Learn more about OpenEnv's agentic workflow
- FleetTaskEnv wraps FleetEnvClient with task-oriented interface - Accepts task configs from export_training_tasks.py - Creates versioned environments on reset - Injects task prompt into observations - Executes verifier for reward computation on episode completion - Supports both sync and async step methods - Factory functions: make_fleet_task_env, from_json_file - Tests: 20 unit tests for init, specs, verifiers, factories 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Greptile SummaryThis PR implements Fleet runtime integration for OpenEnv, enabling remote environment execution without local Docker. The implementation maintains a clean separation between HTTP orchestration ( Key additions:
Critical issues:
Confidence Score: 3/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant Agent as Agent/Policy
participant Client as FleetEnvClient<br/>(HTTP Orchestration)
participant Tools as FleetMCPTools<br/>(MCP Agent Actions)
participant Fleet as Fleet Runtime<br/>(Remote)
participant MCP1 as MCP Endpoint 1<br/>(api/v1/mcp)
participant MCP2 as MCP Endpoint 2<br/>(mcp)
Note over Agent,Fleet: 1. Provision Environment
Agent->>Client: FleetEnvClient.from_fleet(api_key, env_key)
Client->>Fleet: Fleet.make(env_key, image_type="mcp")
Fleet-->>Client: env handle + URLs
Client->>Client: Create FleetMCPTools with MCP URLs
Client-->>Agent: (orch_client, mcp_tools)
Note over Agent,Fleet: 2. Reset Episode (HTTP)
Agent->>Client: reset()
Client->>Fleet: POST /reset
Fleet-->>Client: initial observation
Client-->>Agent: observation
Note over Agent,MCP2: 3. Discover Tools (MCP)
Agent->>Tools: list_tools()
Tools->>MCP1: list_tools() via streamable HTTP
MCP1-->>Tools: tools list A
Tools->>MCP2: list_tools() via streamable HTTP
MCP2-->>Tools: tools list B
Tools->>Tools: Union tools + convert to OpenAI format
Tools-->>Agent: ListToolsAction(tools)
Note over Agent,MCP2: 4. Execute Tool (MCP)
Agent->>Tools: call_tool(name, args)
Tools->>Tools: Check owner cache for tool
Tools->>MCP1: call_tool(name, args) via streamable HTTP
MCP1-->>Tools: tool result
Tools-->>Agent: result
Note over Agent,Fleet: 5. Cleanup
Agent->>Client: close()
Client->>Fleet: env.close()
Fleet-->>Client: terminated
|
| namespace = {} | ||
|
|
||
| # Execute the verifier code to define the function | ||
| exec(verifier_code, namespace) |
There was a problem hiding this comment.
logic: Using exec() on arbitrary code strings is a significant security vulnerability. This allows execution of untrusted Python code without sandboxing.
| exec(verifier_code, namespace) | |
| # SECURITY: exec() allows arbitrary code execution | |
| # Consider using ast.literal_eval for safe data structures | |
| # or a sandboxed execution environment like RestrictedPython | |
| exec(verifier_code, namespace) |
Is the verifier_code always from trusted sources, or could it come from user-provided task configs?
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/envs/fleet_env/task_env.py
Line: 301:301
Comment:
**logic:** Using `exec()` on arbitrary code strings is a significant security vulnerability. This allows execution of untrusted Python code without sandboxing.
```suggestion
# SECURITY: exec() allows arbitrary code execution
# Consider using ast.literal_eval for safe data structures
# or a sandboxed execution environment like RestrictedPython
exec(verifier_code, namespace)
```
Is the verifier_code always from trusted sources, or could it come from user-provided task configs?
How can I resolve this? If you propose a fix, please make it concise.| except BaseException: # noqa: BLE001 | ||
| continue |
There was a problem hiding this comment.
logic: BaseException catches system exits and keyboard interrupts. Use Exception instead to allow proper cleanup and termination signals.
| except BaseException: # noqa: BLE001 | |
| continue | |
| except Exception: # noqa: S110 | |
| continue |
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/envs/fleet_env/mcp_tools.py
Line: 57:58
Comment:
**logic:** `BaseException` catches system exits and keyboard interrupts. Use `Exception` instead to allow proper cleanup and termination signals.
```suggestion
except Exception: # noqa: S110
continue
```
How can I resolve this? If you propose a fix, please make it concise.| except BaseException: | ||
| # Only suppress discovery/connection errors. | ||
| # If call_tool raised, it would have bubbled up above. | ||
| continue |
There was a problem hiding this comment.
logic: BaseException catches system exits and keyboard interrupts. Use Exception instead.
| except BaseException: | |
| # Only suppress discovery/connection errors. | |
| # If call_tool raised, it would have bubbled up above. | |
| continue | |
| except Exception: | |
| # Only suppress discovery/connection errors. | |
| # If call_tool raised, it would have bubbled up above. | |
| continue |
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/envs/fleet_env/mcp_tools.py
Line: 76:79
Comment:
**logic:** `BaseException` catches system exits and keyboard interrupts. Use `Exception` instead.
```suggestion
except Exception:
# Only suppress discovery/connection errors.
# If call_tool raised, it would have bubbled up above.
continue
```
How can I resolve this? If you propose a fix, please make it concise.| env_key=env_spec, | ||
| data_key=data_spec, | ||
| ttl_seconds=self.ttl_seconds, |
There was a problem hiding this comment.
logic: data_key parameter not passed to from_fleet() - the built data_spec is unused.
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/envs/fleet_env/task_env.py
Line: 138:140
Comment:
**logic:** `data_key` parameter not passed to `from_fleet()` - the built `data_spec` is unused.
How can I resolve this? If you propose a fix, please make it concise.| if self._fleet_env: | ||
| self._fleet_env.close() |
There was a problem hiding this comment.
style: Check self._fleet_env is not None instead of truthiness since the handle could be falsy but still valid.
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/envs/fleet_env/client.py
Line: 136:137
Comment:
**style:** Check `self._fleet_env is not None` instead of truthiness since the handle could be falsy but still valid.
How can I resolve this? If you propose a fix, please make it concise.| import asyncio | ||
| return asyncio.get_event_loop().run_until_complete(self.step_async(action)) |
There was a problem hiding this comment.
style: Calling asyncio.get_event_loop().run_until_complete() can fail if an event loop is already running. Consider documenting that step() should only be used from sync code, or handle the RuntimeError.
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/envs/fleet_env/task_env.py
Line: 191:192
Comment:
**style:** Calling `asyncio.get_event_loop().run_until_complete()` can fail if an event loop is already running. Consider documenting that `step()` should only be used from sync code, or handle the `RuntimeError`.
How can I resolve this? If you propose a fix, please make it concise.| try: | ||
| self._orch.close() | ||
| except Exception: | ||
| pass |
There was a problem hiding this comment.
style: Silently catching all exceptions during cleanup can hide issues. Consider logging the exception.
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/envs/fleet_env/task_env.py
Line: 324:327
Comment:
**style:** Silently catching all exceptions during cleanup can hide issues. Consider logging the exception.
<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>
How can I resolve this? If you propose a fix, please make it concise.The MCP images don't exist for all environment versions, causing FleetVersionNotFoundError when trying to create environments. Changing the default to None allows the Fleet SDK to use standard images which are available for all versions. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
FleetEnvClient.from_fleet() was not accepting data_key/data_version parameters, causing them to be passed through **kwargs to HTTPEnvClient which doesn't accept them. - Add data_key and data_version as explicit parameters - Pass them to fleet.make() - Update task_env.py to pass them separately 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Fleet SDK expects data_key in "key:version" format, not as separate parameters. Updated from_fleet() to combine them before calling fleet.make(). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
HTTPEnvClient.reset() doesn't support seed parameter yet. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Increases default timeout from 15s to 60s for Fleet API calls. This prevents timeouts during environment initialization. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Previously reset() did partial work and reset_async() added tool fetching. Now reset_async() does all the work (including fetching tools) and reset() is just a sync wrapper that calls it via run_until_complete(). This ensures both methods return identical results including tools. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
MCP's call_tool() returns a CallToolResult Pydantic object, not plain text. This was causing ugly repr strings to be passed to agents like: "meta=None content=[TextContent(type='text', text='...')] ..." Now properly extracts: - Text content from result.content[].text - Tries JSON parsing for structured results - Falls back to structuredContent if available - Handles isError cases 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Tests for: - FleetMCPClient._extract_tool_result(): - Single text content extraction - JSON parsing from text - Multiple text contents - Error result handling - Structured content fallback - Empty result handling - FleetTaskEnv reset: - reset_async() returns tools - reset() calls reset_async() (sync wrapper) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Move fleet.make() and list_tools() into FleetTaskEnv.__init__() - Tools are now fetched at env creation, not during reset - reset_async() calls _orch.reset() with error handling, returns cached tools - Use asyncio.run() for Python 3.13 compatibility - Update tests for new initialization pattern 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Log task_key and verifier code preview when verifier fails - Catch syntax errors separately with clear message - Show which functions were found if 'verify' is missing Helps debug issues like "Verifier code must define a 'verify' function" 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Replace custom _execute_verifier_local() with Fleet SDK's Task.verify_detailed() which properly sets up the verifier namespace with: - Environment type annotation - Helper functions (normalized_contains, etc.) - Proper function discovery (not just "verify" function) This fixes "name 'Environment' is not defined" errors during verifier execution. Changes: - _compute_reward: Create Fleet SDK Task and call verify_detailed() - Support both 'verifier_code' and 'verifier_func' field names - Add comprehensive logging for debugging - Remove broken _execute_verifier_local method Tests: - Update all verifier tests to mock Fleet SDK Task.verify_detailed() - Add tests for various edge cases (no verifier, no orch, exceptions) - Fix fixture to avoid asyncio.run() conflicts with pytest-asyncio 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add retry with exponential backoff (3 attempts, 1s/2s/4s delays) - Log errors instead of silently swallowing exceptions - Log warning when some clients fail but others succeed - Log error after all retries exhausted This fixes silent failures when MCP connections are flaky, which caused 'no tools found' errors in SkyRL training.
call_tool now retries with exponential backoff (3 attempts, 1s/2s/4s) on connection errors, similar to list_tools. ValueError (tool not found) is not retried. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
SkyRL can end trajectories early (context overflow, its own max_turns) without OpenEnv knowing. Previously the model got 0 reward even if the task was completed. Now close_async()/close() run the verifier when step_async() never computed a reward, storing the result in self.final_reward for SkyRL to read. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
feat: Run verifier at close() for orphaned rollouts
Carlisle tasks (354 total, 8 in eval) require models to call submit_final_answer to submit results, but this tool is a harness-level synthetic injected by the orchestrator's SessionWorkflow, not an MCP tool. OpenEnv connects directly to MCP servers, so the tool was missing — causing 0% scores across all carlisle tasks in training. Changes: - Inject submit_final_answer into tool list when prompt references it - Intercept calls locally (not routed to MCP), store the answer - Pass final_answer to verifier via Fleet SDK's verify_detailed() - Run verifier in close()/close_async() for orphaned rollouts - Add unit tests for the synthetic tool Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
feat: Add submit_final_answer synthetic tool for carlisle tasks
Runs k × m rollouts of generated tasks on Fleet environments. Given (prompt, verifier_code, env_key), creates FleetTaskEnv instances, runs agent loops with model inference, and returns structured results for reward computation (learnability variance + model separation). Used as the inner loop of the task-scaling RL pipeline in SkyRL. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Instead of calling Anthropic directly and running a local agent loop, the evaluator now: 1. Imports the generated task via fleet.import_task() 2. Creates a harness job via fleet.create_job() 3. Polls for completion 4. Extracts per-session verifier scores from job sessions Uses real Fleet model IDs (claude-sonnet-4.5, claude-opus-4.5) instead of the broken weak/strong mapping that required ANTHROPIC_API_KEY. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Fleet harness POST /v1/jobs requires model IDs in 'provider/model' format (e.g., 'anthropic/claude-sonnet-4.5'), not just the model name. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The sync time.sleep() in _poll_job blocked the asyncio event loop, preventing trajectory timeouts from cancelling evaluations. Using asyncio.sleep() allows the event loop to properly handle cancellations. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Fleet returns session model IDs without provider prefix (e.g., "claude-sonnet-4.5")
while we configure them with prefix ("anthropic/claude-sonnet-4.5"). Added
_match_model_id() to normalize and match by bare model name, so scores land
in the correct results_per_model bucket.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
feat: Add TaskEvaluator for task generation inner loop
Add _tool_errors, _verifier_stdout, _verifier_error to FleetTaskEnv so SkyRL can build hints from failed rollout feedback without an LLM call. - Tool errors accumulated in step_async() on MCP errors and exceptions - Verifier stdout/error captured in _compute_reward() after verifier runs - Verifier exceptions also captured in _verifier_error (not just failures) - All feedback properties reset in reset_async() to prevent cross-episode leakage - Properties: verifier_stdout, verifier_error, tool_errors_list Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Expose verifier feedback properties for hint generation
Add describe_db() and query_db() methods (sync + async) to FleetEnvClient, delegating to the Fleet SDK's SQLiteResource. This enables querying provisioned environment databases (seed/current) from outside the container via HTTP, which is needed for task generation workflows where the model explores DB data before designing tasks. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
describe_db_async/query_db_async were wrapping sync describe_db/query_db in asyncio.to_thread, but when _fleet_env is an AsyncEnv (from AsyncFleet.make), .db().describe() and .query() are async coroutines. Now detects whether the resource method is a coroutine and awaits directly instead of using to_thread. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When the Fleet MCP server returns {"base64_image": null} (failed screenshot),
_extract_tool_result would create an image_url block with url=None, causing
downstream AttributeError in SkyRL's extract_images_from_conversation.
Now returns a text error message instead of propagating the null.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
browser_use tasks now get the same treatment as computer_use: - MCP image type (triggers browser sidecar, no MCP image needed) - computer tool only (screenshot, click, type, scroll) - initial screenshot on reset - 1800s TTL Follows platform change: browsers are now sidecars on base images, MCP images are deprecated. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
browser_use tasks now create a standard Fleet env (web UI only) plus a separate browser lease via the Fleet browser API. The browser navigates to the env's web UI and provides the computer tool (screenshot, click, type, scroll) via its MCP endpoint. Flow: Fleet env (standard) → browser lease → navigate → healthcheck → use browser MCP for computer tool. On close, delete the lease. MCP images are deprecated — browsers are now sidecars on base images. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
FLEET_API_KEY is not accepted by the browser lease API (401). Use the shared fallback token from theseus api_token_config.py, same as the harness uses. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Both create_trace_job and upload_trace were hitting orchestrator.fleetai.com — that host runs the env-instance API, not the trace API, so every prior call silently 404'd (the "Failed to create Fleet trace job — Not Found" warning we'd been ignoring). Verified by hand against the live OpenAPI spec on 2026-06-16: POST https://api.internal.fleet-platform.fleetai.com/v1/traces/jobs POST https://api.internal.fleet-platform.fleetai.com/v1/traces/logs with Authorization: Bearer <api_key>. The previous SDK call also used X-API-Key, which the right host responds to with "Authentication required. Provide either API key or JWT token with team ID"; Bearer auth works for both the research and team keys. The /v1/traces/logs body schema also moved from the legacy {"messages": [{"role": "...", "content": "..."}]} shape to {"history": [{"role": "...", "parts": [{"type": "text"|"image", ...}]}]}. upload_trace now flattens OpenAI-style content arrays into the new parts shape so image blocks still surface as inline screenshots in the UI. End-to-end probe with both functions returns a real job_id and session_id (job 5c3831b6-..., session 544bab3a-...). Existing callers (SkyRL trainer.py, main_fleet_tinker.py) need no changes; their try/except wrappers swallow the old failure and will start uploading after this fix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Darktex
left a comment
There was a problem hiding this comment.
Note: This is an automated review by Claude Code, not a human review.
Review: Fleet client (PR #256)
This PR has several blocking issues before merge: (1) the pyproject.toml diff silently drops asyncio_mode = "auto" and the pytest markers table, which will break every existing async test in the repo; (2) three basic tests and the quickstart example all call FleetEnvClient.from_fleet() missing three required parameters and will TypeError on execution; (3) a UUID token that appears to be a real service credential is hardcoded as a fallback in browser_lease.py; and (4) httpx/anyio are used but not declared in the [fleet] optional dependencies.
Tier 1 — Bugs / correctness (must fix)
pyproject.toml: new[tool.pytest.ini_options]section dropsasyncio_mode = "auto",asyncio_default_fixture_loop_scope, and themarkerstable — will break existing async tests across the repo. Merge keys into one section.tests/envs/test_fleet_env.py:83,92,105andexamples/fleet_env_example.py:82: callfrom_fleet()without the requireddata_key,data_version,image_typeparams →TypeErrorat runtime. Add defaults or pass them.src/envs/fleet_env/browser_lease.py:86:_BROWSER_API_TOKEN_FALLBACKhardcodes a UUID token — security invariant violation (secrets must never be committed). Raise/point at the env var instead, or document explicitly if it's a disposable public token.pyproject.toml:httpx(top-level import inbrowser_lease.py) andanyio(used by@pytest.mark.anyiotests) are missing from the[fleet]extras →ImportErroron install.tests/envs/test_fleet_env.py:142:FleetMCPToolstest doesn't setinitial_wait=0, causing an unmocked 8-second sleep in CI.src/envs/fleet_env/models.py:38-42:normalize_schemasilently drops non-null alternatives inanyOf(takes only[0]) — lossy OpenAI schema conversion for union-typed params.src/envs/fleet_env/client.py:369-375:_step_payloadrejects PydanticBaseModelactions (only dataclass/dict) — breaks the repo-wide Pydantic wire-type convention; add amodel_dump()branch.
Tier 2 — Alignment (needs human decision, suggest @Darktex)
- Directory layout:
fleet_envlives undersrc/envs/(bundled into core dist) rather than top-levelenvs/<name>/, and has noopenenv.yaml/pyproject.toml/server//Dockerfile. Deviates from the canonical env scaffolding contract — intentional (client-only lib) but worth an explicit decision. - Rewards inside environment:
FleetTaskEnv._compute_reward()runs the verifier client-side on the orchestrator, not inside a server-sideEnvironment.step(). Deviates from the "rewards inside environment" invariant; no RFC covers the trade-off. - Protocol: extends
HTTPEnvClientwhile HTTP is being deprecated in favor of WebSocket (PR #252). Compatibility should be re-evaluated if/when #252 lands. - Type generics:
FleetTaskEnvdoesn't subclassEnvClient[ActT, ObsT, StateT]; uses raw dicts for reset/step and never exposesstate— diverges from the Gymnasium typed-interface invariant. PR_README.mdcommitted to repo root: belongs in the GitHub PR body; its TODOs should become tracked issues.
Automated review by Claude Code | Learn more
Empirically verified on 2026-06-16: Fleet's /v1/job-session-groups
aggregation reads score ONLY from sessions that have a linked
verifier_execution_id; the inline score field alone is stored but
ignored. Probe sequence:
- POST /v1/traces/logs {score: 1.0} → success, but avg_score in the
job-session-group response stays null
- POST /v1/traces/logs {score: 0.0, verifier_execution_id: <real id>}
→ group's avg_score flips from null to 0.0
So:
- FleetTaskEnv._compute_reward now stashes response.execution_id on
self._last_verifier_execution_id after every verify_detailed call.
- upload_trace gains a verifier_execution_id kwarg; forwards it to
/v1/traces/logs when present.
The SkyRL env wrapper (skyrl-fleet) reads the id from the OpenEnv handle
and threads it into upload_trace — covered by a companion commit there.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…sing exec_id Root cause for all Tinker trace uploads landing with verifier_execution_id=null (symptom: Fleet dashboard showed "no grading runs"): the SDK's sync verify_detailed() spins up its own loop via asyncio.run() inside the asyncio.to_thread() worker, which collides with Event objects bound to the main loop. The verifier never completed server-side and the response we caught at except: was the loop-binding RuntimeError, so _last_verifier_execution_id was never set and the trace uploader sent no exec_id. Switching to fleet._async.tasks.Task and awaiting verify_detailed_async() in the harness's event loop makes the SDK use the async env's HTTP client directly, no thread-pool, no loop creation. Also: - 3-attempt retry with exponential backoff (3s, 6s) on 502/503/Bad Gateway/timeout from the env-instance reverse proxy. Fresh instances occasionally return 502 before they finish warming up; the next attempt succeeds. - Warn (not silently None) when execution_id comes back empty after a verifier call. Future regressions of the same symptom will be visible in logs. Verified against a real v6 bureau-tool_use task: returns success=True with populated exec_id, and trace upload links cleanly in the Fleet dashboard. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Good to see this is rebooted. @adrwz lmk when to review. |
PR: Fleet environments (OpenEnv)
This PR documents and refines the Fleet runtime integration for OpenEnv.
What this enables
reset / step / statetools/list + tools/callWhat this is not
Main abstractions
FleetEnvClient(HTTP): orchestrator handle for reset/step/state.FleetMCPTools(MCP): agent handle for listing/calling tools.api/v1/mcpandmcp)convert_tool_format)Quickstart
pip install "openenv-core[fleet]"export FLEET_API_KEY="..."python examples/fleet_env_example.py <env_key>Plea see the README for a deeper discussion and TODOs.