Skip to content

Architecture: Daemon should be stateless multiplexer, not session database #51

@c-h-

Description

@c-h-

What happened

After 26h of daemon uptime, agentctl daemon status reported 394 active sessions and 79 active locks. After restart, it showed 0. The 394 sessions were mostly stale OpenClaw hook sessions (cron jobs) from 16-23h ago that had already completed on the gateway.

Root cause

Three related issues in session-tracker reaping logic:

1. OpenClaw sessions are never reaped from state (primary bug)

reapStaleEntries() only cleans up sessions based on PID liveness:

if ((record.status === 'running' || record.status === 'idle') && record.pid) {
  // Only reaps if PID is dead
}

OpenClaw sessions have no PID (they're remote gateway sessions). When the gateway stops returning a session (it completed), the session-tracker never notices — the session just sits in state as running forever.

The discover-first approach correctly queries the gateway each poll cycle, but only adds/updates sessions that appear in discover results. It never removes sessions that disappear from discover results. This is the core gap.

2. Auto-locks are never released for sessions that die without session.stop

autoUnlock() is only called in the session.stop RPC handler (server.ts:421,433). When reapStaleEntries() marks a session as stopped (PID dead), it does NOT call autoUnlock(). This means auto-locks for crashed Claude Code sessions accumulate too.

For the 79 locks specifically: these are likely a mix of stale auto-locks from dead sessions and manual locks that were never released.

3. pruneOldSessions() only runs on startup

The 7-day prune (STOPPED_SESSION_PRUNE_AGE_MS) runs once at startPolling(). For long-running daemons, stopped sessions accumulate until the next restart. Should also run periodically (e.g., every hour).

Expected behavior

  • Sessions that disappear from adapter discover() results should be marked stopped and eventually pruned
  • Auto-locks should be released when a session is reaped (not just on explicit stop)
  • Periodic pruning should run during daemon lifetime, not just on startup

Fix suggestions

  1. In poll(): After discover, build a set of all discovered session IDs. Any session in state with status=running/idle whose adapter matches but whose ID is NOT in the discovered set → mark stopped. (Be careful: only reap sessions from adapters that successfully returned results, to avoid mass-reaping on transient adapter failures.)

  2. In reapStaleEntries(): When marking a session stopped, also call lockManager.autoUnlock(sessionId).

  3. Periodic prune: Run pruneOldSessions() on a timer (e.g., hourly) in addition to startup.

How to reproduce

  1. Start daemon, wait for OpenClaw adapter to discover sessions
  2. Wait for those sessions to complete on the gateway side
  3. agentctl daemon status — sessions count grows monotonically
  4. agentctl list -a — shows hundreds of stale OpenClaw sessions as running/idle

Environment

  • agentctl (current main)
  • macOS, OpenClaw gateway with frequent hook sessions (cron)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions