Skip to content

[FEATURE] WebSocket-based Concurrency Architecture#239

Merged
burtenshaw merged 45 commits into
huggingface:releasefrom
rycerzes:impl/concurrency
Dec 18, 2025
Merged

[FEATURE] WebSocket-based Concurrency Architecture#239
burtenshaw merged 45 commits into
huggingface:releasefrom
rycerzes:impl/concurrency

Conversation

@rycerzes

@rycerzes rycerzes commented Dec 7, 2025

Copy link
Copy Markdown
Contributor

Add WebSocket support with concurrent session management

Adds WebSocket endpoints for persistent environment sessions with configurable concurrency limits #194

High-level Diff

These are the results on the server side:

- env = MyEnvironment()
  app = create_app(
-      env,
+     MyEnvironment,              # Pass class (factory), not instance
      MyAction,
      MyObservation,
+     max_concurrent_envs=4,      # Allow 4 concurrent WebSocket sessions
)

On the client side, it requires a change of URL scheme:

from envs.echo_env import EchoEnv, EchoAction

+ client = EchoEnv(base_url="ws://localhost:8000/ws")
- client = EchoEnv(base_url="http://localhost:8000")

result = client.reset()
result = client.step(EchoAction(message="Hello!"))

# or async with
+ result = await client.reset()
+ result = await client.step(EchoAction(message="Hello!"))

This leads to high concurrency with limited resources:

image

Changes

  • WebSocket endpoint at /ws with message protocol for reset/step/state/close
  • Factory pattern support: pass environment class instead of instance to create per-session environments.
  • ConcurrencyConfig for setting max concurrent sessions and session timeout.
  • SUPPORTS_CONCURRENT_SESSIONS flag on environments (defaults to False) with startup validation.
  • Session capacity tracking and error handling.
  • New client: WebSocketEnvClient for persistent connections.

API

New types:

  • ConcurrencyConfig(max_concurrent_envs, session_timeout)
  • SessionInfo and ServerCapacityStatus for session metadata.
  • WebSocket message types: WSIncomingMessage (discriminated union of WSResetMessage, WSStepMessage, WSStateMessage, WSCloseMessage).
  • Response types: WSObservationResponse, WSStateResponse, WSErrorResponse.

Usage:

class MyEnvironment(Environment):
    # Must be set to True to allow max_concurrent_envs > 1
    SUPPORTS_CONCURRENT_SESSIONS = True 
    
    # ... implementation ...

# Factory mode for concurrent sessions
app = create_app(
    env=MyEnvironment,  # Pass class, not instance
    action_cls=MyAction,
    observation_cls=MyObservation,
    max_concurrent_envs=4
)

Defaults to max_concurrent_envs=1 for backward compatibility. Environments must set SUPPORTS_CONCURRENT_SESSIONS=True to allow higher concurrency.

TODO

  • openenv init needs the WebSocket code integrated into the template

Future Work (Separate PRs)

  • Session timeout enforcement (tracked but not implemented)
  • Resource monitoring (memory/CPU per session)
  • Connection queueing when capacity is reached
  • Mark safe environments as SUPPORTS_CONCURRENT_SESSIONS=True
  • Update envs to support concurrency

rycerzes and others added 5 commits December 4, 2025 23:01
…erver capabilities

- Introduced WebSocketEnvClient for persistent sessions with multi-step interactions.
- Updated HTTPEnvServer to support WebSocket connections and manage multiple concurrent environments.
- Added WebSocket message types and responses for better communication.
- Enhanced Environment interface with concurrency safety attributes.
@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label Dec 7, 2025
@rycerzes rycerzes changed the base branch from main to release December 7, 2025 20:28
@rycerzes

rycerzes commented Dec 7, 2025

Copy link
Copy Markdown
Contributor Author

@burtenshaw draft PR for the ws and concurrency. I have merged the #238 into this as well.

Few notes, before #232 gets merged:

  • openenv init generates boilerplate template according to the old structure

  • openenv init needs the WebSocket code integrated into the template:

    • Add WebSocket client example/template
    • Update server templates to show WebSocket endpoint usage
    • Include documentation on CONCURRENCY_SAFE flag and concurrent sessions
  • VectorEnv abstraction for batched operations inspired by Gymnasium

@burtenshaw burtenshaw mentioned this pull request Dec 8, 2025
@burtenshaw

burtenshaw commented Dec 8, 2025

Copy link
Copy Markdown
Collaborator

Amazing work @rycerzes . Thanks

  • openenv init generates boilerplate template according to the old structure.

I'll integrate this in a new PR for you to merge here.

  • VectorEnv abstraction for batched operations inspired by Gymnasium

I think we can leave this for a subsequent PR.

Also, this env might be useful to you. It's basically just a benchmarking env that let's you test concurrency asynchronously like this.

@burtenshaw

Copy link
Copy Markdown
Collaborator

@rycerzes could you help me to understand this please:

openenv init generates boilerplate template according to the old structure.

What do you mean by old structure? afaik #232 openenv init generates a template with a corresponding structure to the branch. i.e. from:

from openenv.core.env_server.interfaces import Environment
from openenv.core.env_server.types import State

@burtenshaw burtenshaw changed the title feat: WebSocket-based Concurrency Architecture [FEATURE] WebSocket-based Concurrency Architecture Dec 8, 2025
@rycerzes

rycerzes commented Dec 8, 2025

Copy link
Copy Markdown
Contributor Author

@burtenshaw

Thanks for the clarification! You're absolutely right - I need to correct my earlier comment.

What do you mean by old structure? afaik #232 openenv init generates a template with a corresponding structure to the branch. i.e. from:

from openenv.core.env_server.interfaces import Environment
from openenv.core.env_server.types import State

I must have run openenv init from the main branch when I was testing, which would explain the confusion. The openenv init command on both the impl/concurrency branch and in #232 does generate the correct new structure with openenv.core imports.

I just verified this by running uv run openenv init test_env -o tests/ on the current branch, and it correctly generates all files with the new import structure. I have updated my above comment accordingly 👍


Also, this env might be useful to you. It's basically just a benchmarking env that let's you test concurrency asynchronously like this.

Thanks! That benchmark env would be perfect for testing the concurrency implementation. I'll take a look at it.
Apologies for the confusion on point 1!

@Wauplin Wauplin left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this very important piece @rycerzes! I've left quite some comments on how I would do things but some parts are left to the maintainers' decisions 🤗 Especially:

  1. should we allow "instantiate a server by passing an env instead of an env factory" to keep backward compatibility? => I would say "no" since project is still in early phase
  2. should we maintain both a "HTTP-based interface" and a "websocket-based interface"? => same, I would say "no" at it means doubling the amount of work (2 paths in the http server and 2 very similar clients to maintain with same interface with different internal logic). Better to keep only 1 interface that is more robust for the future. End users should not be impacted by this decision (except for the breaking change to adapt).

Apart from that, I usually tend to advice to simplify logic by not adding too many optional features at first. More options usually means more internal logic and more maintenance burden on the long run. So if something is not explicitly required, let's keep it for later.

Note that I haven't run the code myself. Will give it a try soon!

Comment thread src/openenv/core/env_server/types.py
Comment thread src/openenv/core/env_server/types.py Outdated
Comment thread src/openenv/core/env_server/types.py Outdated
Comment thread src/openenv/core/env_server/types.py Outdated
Comment thread src/openenv/core/env_server/types.py Outdated
Comment thread src/openenv/core/env_server/http_server.py Outdated
Comment thread src/openenv/core/env_server/http_server.py Outdated
Comment thread src/openenv/core/env_server/http_server.py Outdated
Comment thread src/openenv/core/ws_env_client.py Outdated
Comment thread src/openenv/core/ws_env_client.py Outdated
@burtenshaw

burtenshaw commented Dec 8, 2025

Copy link
Copy Markdown
Collaborator

@pankit-eng @zkwentz Can you validate these two backward compatibility points from @Wauplin on this PR . In short, should we go all in on websockets or maintain a http implementation?

  • should we allow "instantiate a server by passing an env instead of an env factory" to keep backward compatibility? => I would say "no" since project is still in early phase

Server side app will look like this:

# Factory mode for concurrent sessions
app = create_app(
    env=MyEnvironment,  # Pass class, not instance
    max_concurrent_envs=4
)
  • should we maintain both a "HTTP-based interface" and a "websocket-based interface"? => same, I would say "no" at it means doubling the amount of work (2 paths in the http server and 2 very similar clients to maintain with same interface with different internal logic). Better to keep only 1 interface that is more robust for the future. End users should not be impacted by this decision (except for the breaking change to adapt).

iiuc, it client code will only look like this:

from envs.echo_env import EchoEnv, EchoAction

client = EchoEnv(base_url="ws://localhost:8000/ws")

result = await client.reset()
result = await client.step(EchoAction(...))

@burtenshaw

Copy link
Copy Markdown
Collaborator

@rycerzes I tested out this branch and it worked well. I updated the PR description myself with a high-level before and after snippet and some benchmarking info.

@rycerzes

Copy link
Copy Markdown
Contributor Author

Thanks @Wauplin for the detailed review! Really appreciate all the feedback - the suggestions on simplifying the message types with discriminators, refactoring the capacity status, and cleaning up the validation logic make a lot of sense. I'll work through these and have them resolved by end of Friday.

@burtenshaw Thanks for testing the branch and updating the PR description with the benchmarking info!

@rycerzes

Copy link
Copy Markdown
Contributor Author

Implemented the changes according to @Wauplin's extensive code review comments.

Summary of recent changes:

  • Refactored Message Types: Replaced WSMessage with typed BaseMessage and discriminated WSIncomingMessage unions to simplify validation.
  • Enforced Factory Pattern: create_app now strictly requires an environment class (factory) rather than an instance, removing the mixed-mode complexity.
  • Renaming & Cleanup: Renamed CONCURRENCY_SAFE to SUPPORTS_CONCURRENT_SESSIONS and removed redundant ConcurrencySafetyLevel enums.
  • Web Interface: Updated the web interface to be fully compatible with the new WebSocket architecture.
  • Templates: Updated CLI templates to use Pydantic and relative imports. Thanks to @burtenshaw

Still waiting on feedback from @pankit-eng and @zkwentz regarding the backward compatibility points (HTTP vs WS exclusivity), but otherwise, the implementation is complete.

@rycerzes rycerzes requested a review from Wauplin December 14, 2025 18:06
@burtenshaw

Copy link
Copy Markdown
Collaborator

Great work @rycerzes . I'll review tomorrow. could we update the status 'Ready for Review ' please.

@rycerzes rycerzes marked this pull request as ready for review December 14, 2025 19:12

@burtenshaw burtenshaw left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work @rycerzes. From me this is functionally complete and ready to merge. I tested the implementation on a series of benchmark envs and pushed to the hub.

Screenshot 2025-12-15 at 20 25 41

@pankit-eng will have the final say on deprecating http, and you can merge this PR to do that.

@Darktex

Darktex commented Dec 17, 2025

Copy link
Copy Markdown
Collaborator

I asked Opus to do a review of this PR and this is what it came up with:

Pull Request #239 Review

Summary

This PR adds WebSocket-based concurrency support to OpenEnv, enabling persistent sessions with configurable concurrency limits. The core architecture is well-designed with proper session isolation, factory pattern enforcement, and clean error handling. However, there are critical issues that need to be addressed before merge.

Overall Assessment: Functionally strong architecture with critical bugs that will break existing environments.

Critical Issues

1. All Existing Environments Pass Instances Instead of Classes

Severity: CRITICAL - Will cause TypeError at startup for all existing environments

The new create_app() function now requires a callable (class/factory), but all existing environments still pass instances:

File Current (Broken) Required
envs/echo_env/server/app.py:37-40 env = EchoEnvironment() EchoEnvironment
envs/coding_env/server/app.py:29-33 env = PythonCodeActEnv() PythonCodeActEnv
envs/chat_env/server/app.py:69-72 env = ChatEnvironment(...) Factory function
envs/browsergym_env/server/app.py:19-34 env = BrowserGymEnvironment(...) Factory function
envs/atari_env/server/app.py Instance Class
envs/dipg_safety_env/server/app.py Instance Class
envs/git_env/server/app.py Instance Class
envs/openspiel_env/server/app.py Instance Class
envs/textarena_env/server/app.py Instance Class

Fix Required: Update all environment app.py files to pass classes or factory functions.

For environments with constructor args (chat_env, browsergym_env), create factory functions:

# chat_env example
def create_chat_environment():
    tokenizer = get_tokenizer()
    return ChatEnvironment(tokenizer=tokenizer, system_prompt=system_prompt)

app = create_app(create_chat_environment, ChatAction, ChatObservation, env_name="chat_env")

2. Duplicate /ws Endpoint Registration Breaks Concurrency

Severity: CRITICAL - WebSocket concurrency won't work when web interface is enabled

src/openenv/core/env_server/web_interface.py:286 registers a /ws endpoint that overrides the one from src/openenv/core/env_server/http_server.py:617.

When ENABLE_WEB_INTERFACE=true:

  1. create_web_interface_app() calls create_fastapi_app() which registers /ws for concurrent sessions
  2. Then it registers another /ws for web UI state updates
  3. The second registration overwrites the first

Result: The concurrent WebSocket sessions feature is completely broken when the web interface is enabled.

Fix: Use different paths:

  • /ws for concurrent sessions (the main feature)
  • /ws/ui or /web/ws for web interface real-time updates

Important Issues

3. echo_env Client Missing WebSocket Support

Severity: IMPORTANT - Inconsistent with PR goals

envs/echo_env/client.py only has EchoEnv (HTTP client), but the CLI template (src/openenv/cli/templates/openenv_env/client.py) includes both HTTP and WebSocket clients.

The echo_env (which serves as a reference implementation) should be updated to include EchoEnvWS WebSocket client to match the template.


4. HTTP Endpoints Create New Environment Per Request

Severity: IMPORTANT - Stateless HTTP defeats session continuity

In src/openenv/core/env_server/http_server.py:372-393 and src/openenv/core/env_server/http_server.py:407-429, the HTTP /reset and /step endpoints create a new environment instance for every request and immediately close it:

async def reset_handler(...):
    _env = self._env_factory()  # New instance every time
    try:
        # ... do work ...
    finally:
        _env.close()  # Immediately destroyed

This means:

  • HTTP clients have no session continuity
  • Each /step call works on a fresh environment (losing all state from /reset)
  • The HTTP interface is essentially unusable for multi-step episodes

This appears intentional (pushing users toward WebSocket), but the PR description suggests HTTP should still work. Clarify intent or keep a single shared instance for HTTP clients.


Minor Issues

5. Session Timeout Tracked But Not Enforced

Severity: MINOR - Listed as future work

src/openenv/core/env_server/types.py:279-283 defines session_timeout in ConcurrencyConfig, and session activity is tracked in src/openenv/core/env_server/http_server.py:296-309, but no background task enforces timeouts.

This is already noted in the PR description as future work.


6. ThreadPoolExecutor Sizing

Severity: MINOR - Performance consideration

src/openenv/core/env_server/http_server.py:167 uses a hardcoded max_workers=32 for the global executor. Consider making this configurable or tied to max_concurrent_envs.


7. is_concurrency_safe Property Creates Temporary Instance

Severity: MINOR - Wasteful for non-class factories

src/openenv/core/env_server/http_server.py:343-352 creates a temporary environment instance just to check SUPPORTS_CONCURRENT_SESSIONS when the factory isn't a class. This is already done during _validate_concurrency_safety() at init time, so this property could cache the result.


Positive Aspects

  • Clean discriminated union types: WSIncomingMessage with Field(discriminator="type") is elegant
  • Well-structured exceptions: Custom exceptions (ConcurrencyConfigurationError, SessionCapacityError, EnvironmentFactoryError) with rich context
  • Session lifecycle management: Proper cleanup with locks, dedicated executors per session
  • Factory pattern enforcement: Clear error message when passing instance instead of class
  • SUPPORTS_CONCURRENT_SESSIONS flag: Good safety mechanism to prevent misconfigured concurrency
  • Sync/async flexibility: Environment can implement reset_async/step_async for true async support
  • CLI template includes both clients: Good for new environments

File-by-File Analysis

Core Implementation

File Status Notes
src/openenv/core/env_server/http_server.py Good Clean architecture, but HTTP handlers create new env per request
src/openenv/core/env_server/types.py Good Well-typed WebSocket message protocol
src/openenv/core/env_server/interfaces.py Good Clean SUPPORTS_CONCURRENT_SESSIONS flag
src/openenv/core/env_server/exceptions.py Good Rich exception context
src/openenv/core/env_server/web_interface.py Broken /ws endpoint conflict
src/openenv/core/ws_env_client.py Good Clean WebSocket client with context manager

CLI Templates

File Status Notes
src/openenv/cli/templates/openenv_env/server/app.py Good Correctly uses class, has max_concurrent_envs=1
src/openenv/cli/templates/openenv_env/client.py Good Includes both HTTP and WS clients

Existing Environments

Environment Status Issue
echo_env Broken Passes instance, missing WS client
coding_env Broken Passes instance
chat_env Broken Passes instance
browsergym_env Broken Passes instance
atari_env Broken Passes instance
dipg_safety_env Broken Passes instance
git_env Broken Passes instance
openspiel_env Broken Passes instance
textarena_env Broken Passes instance

Action Items

  • CRITICAL: Fix all environment app.py files to pass classes/factories instead of instances
  • CRITICAL: Fix /ws endpoint conflict in web_interface.py (use /ws/ui or similar)
  • Add EchoEnvWS WebSocket client to echo_env as reference implementation
  • Clarify HTTP behavior: intentionally stateless or should maintain session?
  • Consider adding integration tests for WebSocket concurrency
  • Update documentation to explain factory pattern requirement

Review Decision

REQUEST_CHANGES - The critical issues will break all existing environments and the WebSocket concurrency feature when the web interface is enabled. These must be fixed before merge.

@Darktex Darktex left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left comments inline and verified the issues that Claude found. Let's please fix them!

I have also reviewed the PR myself (I swear I'm not only a Claude reseller! Lmao), and things look otherwise good. I'm generally allergic to factory patterns but I agree that we need them here.

@rycerzes what is the plan to migrate the existing envs and examples? (we can also try to give this task to copilot, should be very straightforward...).

def __init__(
self,
env: Environment,
env: Callable[[], Environment],

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be env_factory. I think the more pythonic way of doing factories is arguably passing the class and the args and kwargs, no?

so like this?

app = create_app(
      env_class=ChatEnvironment,
      env_kwargs={"tokenizer": tokenizer, "system_prompt": prompt},
      ...
  )


Args:
env: The Environment instance to wrap
env: Environment factory (callable) that creates new instances.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the pattern of passing env_cls also works better given that we are already passing action_cls and observation_cls so it resonates more imho

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude is right, this now becomes a problem

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be /web/ws

@burtenshaw

Copy link
Copy Markdown
Collaborator

Thanks for the review @Darktex.

@rycerzes I'll take on these changes on this branch and merge #252 as well.

@rycerzes

Copy link
Copy Markdown
Contributor Author

@rycerzes what is the plan to migrate the existing envs and examples? (we can also try to give this task to copilot, should be very straightforward...).

Thank you for the review @Darktex, for now, I had planned to migrate the existing envs and examples manually but I think we could try using copilot for this as well.

@Darktex Darktex left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's land it, but let's at least make the change to the env arg in HTTPServer to env_factory to reduce confusion

@burtenshaw burtenshaw merged commit 72c5c83 into huggingface:release Dec 18, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants