fix: improve websocket error handling with actionable messages#626
Open
fix: improve websocket error handling with actionable messages#626
Conversation
Guimove
commented
Apr 2, 2026
90a28f5 to
d1be6d4
Compare
…n and actionable messages Stop infinite retry loops on permanent websocket errors (permission denied, auth failures). Detect error type and show clear, actionable messages instead of raw websocket close frames. Changes: - Add shared error classifier (IsPermissionError, IsPermanentCloseError, IsInternalServerError) - Attempt token refresh once on 1008 auth errors (JWT only, not env tokens) - Gate dial-level refresh on HTTP 401/403 to avoid corrupting stored credentials - Reset refresh guards after successful connection for long-lived sessions - Show minimum required role (Deployer) on permission denied - Show cluster health message on 1011 (treated as transient, retries) - Stop retry loop and exit cleanly on permanent errors (1007, 1008) - Replace log.Fatal with log.Errorf in port-forward (don't kill listener) - Handle SIGINT alongside SIGTERM, Ctrl+C support when not connected - Drain done channel after write/ping errors to capture real close frame
d1be6d4 to
e6f9e98
Compare
erebe
reviewed
Apr 3, 2026
| if err != nil { | ||
| log.Fatal("error while creating websocket connection", err) | ||
| if !utils.IsUsingEnvToken() && IsAuthDialError(resp) { | ||
| if _, refreshErr := utils.ForceRefreshAccessToken(); refreshErr == nil { |
Contributor
There was a problem hiding this comment.
Is the ForceRefreshAccessToken really needed ? the call createLogWebsocket is already fetching the token and refresh it if it has expired
Contributor
Author
There was a problem hiding this comment.
Even for long term connections ?
Contributor
There was a problem hiding this comment.
Yes, the auth is already made at the establishment of the connection. After the user has 1 hour before the connection is force cut by the server.
if it reconnect, the auth will be re-checked again
Contributor
Author
There was a problem hiding this comment.
Ok so maybe I can remove that part
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Summary
Stop infinite retry loops on permanent websocket errors (permission denied, auth failures). Detect error type and show clear, actionable messages instead of raw websocket close frames. Attempt token refresh once on auth failures (gated on HTTP 401/403 or WS 1008). Exit cleanly on permanent errors instead of hanging.
Before / After
Permission denied (Viewer role on
qovery shell)Before:
After (with API token):
After (with JWT from
qovery auth):No shell-agent on cluster (1011)
Before:
After:
Changes
pkg/wserror.goIsAuthDialErrorpkg/wserror_test.gopkg/shell_test.goutils/auth.goForceRefreshAccessToken()for websocket auth recoveryutils/auth_test.goIsUsingEnvTokenandForceRefreshAccessTokenutils/context.goIsUsingEnvToken()to skip refresh for env-provided tokenspkg/shell.gopkg/log.gopkg/port-forward.golog.Fataltolog.Errorf(don't kill listener)Error classification
Design decisions
dialRefreshed/tokenRefreshed): reset after successful connection for long-lived sessionsIsUsingEnvTokenskips ALL refresh:GetAccessToken()always returns env var directly, stored context refresh is futileTest plan
qovery shellwith Viewer API token — permission denied message, exits immediatelyqovery shellwith invalid API token — 401 Unauthorized, exitsqovery shellwith Viewer token on other org — 403 Forbidden, exitsqovery shellwith no shell-agent (1011) — cluster health message, retries, Ctrl+C exitsqovery shellwith valid Deployer+ token — shell works normally, interactive