Skip to content

fix(claude): don't time out a turn while it waits on a slow tool approval#382

Merged
mbektas merged 1 commit into
plmbr:mainfrom
pjdoland:fix/381-tool-approval-long-wait
Jun 25, 2026
Merged

fix(claude): don't time out a turn while it waits on a slow tool approval#382
mbektas merged 1 commit into
plmbr:mainfrom
pjdoland:fix/381-tool-approval-long-wait

Conversation

@pjdoland

Copy link
Copy Markdown
Collaborator

Summary

Approving a tool after a long wait (around 30 minutes) threw a "Claude agent response timeout" error, and subsequent prompts then got stuck. The root cause: the response timeout (CLAUDE_AGENT_CLIENT_RESPONSE_TIMEOUT, default 1800s) counted the time the agent spent legitimately blocked on the user's tool approval, so a slow human reply timed out the turn while the worker was still mid-request awaiting the approval. The orphaned request then wedged the next prompt.

Solution

Exclude user-wait time from the response timeout. ChatResponse now tracks the wall-clock time it spends blocked on user input (every tool approval, plan confirm, and AskUserQuestion funnels through wait_for_chat_user_input), and _send_claude_agent_request subtracts that from the timeout window. A slow human reply no longer looks like an unresponsive agent, while the timeout still fires for a genuine post-approval hang.

Cancel in-flight requests on on_close. Because a parked approval no longer self-resolves via the timeout, an abandoned turn (the user closes the tab) would otherwise spin its worker thread and the Claude subprocess forever; cancelling makes the poll loop tear the subprocess down. The handler dict is snapshotted before iterating so a worker popping its own entry at the same instant can't raise mid-iteration.

wait_for_chat_user_input now disconnects its listener in a finally, fixing a latent subscription leak on a cancelled turn.

Testing

Unit tests cover the new ChatResponse wait accounting (starts at zero, counts an in-progress wait, accumulates completed waits, clears in-progress on answer, disconnects the listener) and the on_close cancellation (cancels each in-flight token, continues if one cancel raises, clears the dict). Full suite green: 1235 passed plus the 57-test Claude client suite.

Verified live in a running JupyterLab with a lowered timeout so a "long wait" is fast to reproduce: an approval held about 7x past the timeout did not error, completed on approval, and the next prompt responded normally (both symptoms fixed). Separately, closing the tab mid-approval fired on_close and tore the subprocess down within a poll cycle, so an abandoned approval no longer leaks; ungraceful drops are covered by Tornado's websocket ping timeout firing on_close.

Risks / follow-ups

The accounting assumes a single in-flight approval at a time, which holds because the Claude SDK serializes tool permission prompts; worth a note near wait_for_chat_user_input if that ever changes. The timing uses wall-clock time.time() rather than a monotonic clock, so a system clock adjustment mid-wait could skew the window slightly; cosmetic, optional to switch to a monotonic source.

Closes #381.

…oval (plmbr#381)

A tool approval the user took a long time to confirm (around 30 min) was
counted toward CLAUDE_AGENT_CLIENT_RESPONSE_TIMEOUT, so the turn errored with
"Claude agent response timeout" while the worker was still mid-request awaiting
the approval. The orphaned request then wedged the next prompt.

- Track on ChatResponse the wall-clock time spent blocked on user input (every
  approval, plan confirm, and AskUserQuestion goes through
  wait_for_chat_user_input) and subtract it from the response-timeout window in
  _send_claude_agent_request. A slow human reply no longer looks like an
  unresponsive agent; the timeout still fires for a genuine post-approval hang.
- wait_for_chat_user_input now disconnects its listener in a finally, so a
  cancelled turn can't leak the subscription.
- Cancel in-flight requests in on_close. With the timeout no longer firing for
  a parked approval, an abandoned turn (the user closes the tab) would
  otherwise spin its worker and Claude subprocess forever; cancelling makes the
  poll loop tear the subprocess down. Snapshot the handler dict before
  iterating so a worker popping its own entry can't raise mid-iteration.

Verified live with a lowered timeout: an approval held ~7x past the timeout did
not error, completed on approval, and the next prompt responded normally; and a
tab close mid-approval tore the subprocess down within a poll cycle.
@pjdoland pjdoland added the bug Something isn't working label Jun 25, 2026

@mbektas mbektas left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks a lot @pjdoland !

@mbektas mbektas merged commit 2f7e7a7 into plmbr:main Jun 25, 2026
4 of 5 checks passed
@pjdoland pjdoland deleted the fix/381-tool-approval-long-wait branch June 25, 2026 16:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Approving a tool after a long wait is causing agent to become unresponsive

2 participants