fix(claude): don't time out a turn while it waits on a slow tool approval#382
Merged
Merged
Conversation
…oval (plmbr#381) A tool approval the user took a long time to confirm (around 30 min) was counted toward CLAUDE_AGENT_CLIENT_RESPONSE_TIMEOUT, so the turn errored with "Claude agent response timeout" while the worker was still mid-request awaiting the approval. The orphaned request then wedged the next prompt. - Track on ChatResponse the wall-clock time spent blocked on user input (every approval, plan confirm, and AskUserQuestion goes through wait_for_chat_user_input) and subtract it from the response-timeout window in _send_claude_agent_request. A slow human reply no longer looks like an unresponsive agent; the timeout still fires for a genuine post-approval hang. - wait_for_chat_user_input now disconnects its listener in a finally, so a cancelled turn can't leak the subscription. - Cancel in-flight requests in on_close. With the timeout no longer firing for a parked approval, an abandoned turn (the user closes the tab) would otherwise spin its worker and Claude subprocess forever; cancelling makes the poll loop tear the subprocess down. Snapshot the handler dict before iterating so a worker popping its own entry can't raise mid-iteration. Verified live with a lowered timeout: an approval held ~7x past the timeout did not error, completed on approval, and the next prompt responded normally; and a tab close mid-approval tore the subprocess down within a poll cycle.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Approving a tool after a long wait (around 30 minutes) threw a "Claude agent response timeout" error, and subsequent prompts then got stuck. The root cause: the response timeout (
CLAUDE_AGENT_CLIENT_RESPONSE_TIMEOUT, default 1800s) counted the time the agent spent legitimately blocked on the user's tool approval, so a slow human reply timed out the turn while the worker was still mid-request awaiting the approval. The orphaned request then wedged the next prompt.Solution
Exclude user-wait time from the response timeout.
ChatResponsenow tracks the wall-clock time it spends blocked on user input (every tool approval, plan confirm, and AskUserQuestion funnels throughwait_for_chat_user_input), and_send_claude_agent_requestsubtracts that from the timeout window. A slow human reply no longer looks like an unresponsive agent, while the timeout still fires for a genuine post-approval hang.Cancel in-flight requests on
on_close. Because a parked approval no longer self-resolves via the timeout, an abandoned turn (the user closes the tab) would otherwise spin its worker thread and the Claude subprocess forever; cancelling makes the poll loop tear the subprocess down. The handler dict is snapshotted before iterating so a worker popping its own entry at the same instant can't raise mid-iteration.wait_for_chat_user_inputnow disconnects its listener in afinally, fixing a latent subscription leak on a cancelled turn.Testing
Unit tests cover the new
ChatResponsewait accounting (starts at zero, counts an in-progress wait, accumulates completed waits, clears in-progress on answer, disconnects the listener) and theon_closecancellation (cancels each in-flight token, continues if one cancel raises, clears the dict). Full suite green: 1235 passed plus the 57-test Claude client suite.Verified live in a running JupyterLab with a lowered timeout so a "long wait" is fast to reproduce: an approval held about 7x past the timeout did not error, completed on approval, and the next prompt responded normally (both symptoms fixed). Separately, closing the tab mid-approval fired
on_closeand tore the subprocess down within a poll cycle, so an abandoned approval no longer leaks; ungraceful drops are covered by Tornado's websocket ping timeout firingon_close.Risks / follow-ups
The accounting assumes a single in-flight approval at a time, which holds because the Claude SDK serializes tool permission prompts; worth a note near
wait_for_chat_user_inputif that ever changes. The timing uses wall-clocktime.time()rather than a monotonic clock, so a system clock adjustment mid-wait could skew the window slightly; cosmetic, optional to switch to a monotonic source.Closes #381.