Skip to content

[HistoryServer] Propagate error on storage APIs#4858

Open
dentiny wants to merge 9 commits into
ray-project:masterfrom
dentiny:hjiang/propagate-error-on-read
Open

[HistoryServer] Propagate error on storage APIs#4858
dentiny wants to merge 9 commits into
ray-project:masterfrom
dentiny:hjiang/propagate-error-on-read

Conversation

@dentiny

@dentiny dentiny commented May 22, 2026

Copy link
Copy Markdown
Contributor

Why are these changes needed?

Error should be propagated up unless we have reason not to do so.

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

Comment thread historyserver/pkg/historyserver/router.go
Comment thread historyserver/pkg/historyserver/router.go
// for metadata, fileName is generated by historyserver, which will obey the same rule as collector to make sure the historyserver can ready the right file.
//
GetContent(clusterId string, fileName string) io.Reader
GetContent(clusterId string, fileName string) (io.Reader, error)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @dentiny!

We will discuss the interface change in the next meeting.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure - - PoAn is working on opendal-based storage interface, while reviewing his PR I surprising found quite a few APIs don't propagate error, not sure why 🤷

@dentiny dentiny changed the title [HistoryServer] Propagate error on read [HistoryServer] Propagate error on storage APIs May 22, 2026
@dentiny dentiny requested a review from JiangJiaWei1103 May 24, 2026 13:20
@dentiny dentiny force-pushed the hjiang/propagate-error-on-read branch from 483e466 to 8dcb43c Compare May 24, 2026 13:35
Comment thread historyserver/pkg/historyserver/reader.go
Comment thread historyserver/pkg/historyserver/router.go
reader := s.reader.GetContent(rayClusterNameNamespace, logPath)
reader, err := s.reader.GetContent(rayClusterNameNamespace, logPath)
if err != nil {
return nil, utils.NewHTTPError(fmt.Errorf("failed to get log file %s: %w", logPath, err), http.StatusNotFound)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we always return 404 here? There could be timeout/network errors.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • We don't use http error, or status code at storage layer (i.e., GCS/S3/etc), so it's hard to translate into http status code at reader
  • Failures caused by timeout and transient network issues should retried internally at storage layer, so error propagated at reader should be non-retriable ones
  • I updated the error code from not found to server internal, which is more general

@dentiny dentiny requested a review from rueian May 24, 2026 23:04
Comment thread historyserver/pkg/historyserver/router.go Outdated

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Reviewed by Cursor Bugbot for commit f3fb687. Configure here.

Comment thread historyserver/pkg/historyserver/reader.go

@JiangJiaWei1103 JiangJiaWei1103 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the effort!

The one thing blocking this for me: this PR treats every storage error as 500, but the storage layer does not actually distinguish "resource not found" from "storage failure". For example, "not found" and "S3 is down" arrive as the same undifferentiated error.

One possible fix: actually distinguish the two at the storage layer, then let the router decide:

// storage package
var ErrNotFound = errors.New("object not found")
// each backend wraps real not-found: fmt.Errorf("...: %w", storage.ErrNotFound)

// router:
if errors.Is(err, storage.ErrNotFound) {
    // 404 for metadata; empty-JSON / 404 fallback for additional endpoints
} else {
    // 500
}

Another thing worth discussing (non-blocking): listFilesRecursive previously returned whatever was found, skipping failed subdirectory; now any single failure fails the whole request. This is a behavior change, and I'd like to hear whether that's expected. Thanks!

logrus.Errorf("Failed to get additional endpoint data for %s: %v", req.Request.URL.Path, err)
resp.WriteErrorString(http.StatusInternalServerError, "Failed to retrieve endpoint data from storage")
return
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to change behavior for clusters that never enabled serve. The collector never stores /api/serve/applications data , so GetContent returns a non-nil err.

Before this PR, that err fell through to the reader == nil branch and emptyResponseForEndpoint() returned {"applications":{}} (as the comment on L1011-1013). After this PR it returns 500, and the frontend shows an error state.

A missing resource is a client-facing 404, not a server-internal 500 (500 tends to trigger alerts / client retries). We may need to distinguish the two.

resp.WriteErrorString(http.StatusInternalServerError, "Failed to retrieve cluster metadata from storage")
return
}
if reader == nil {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same root issue: Since every backend returns an err (not a nil reader) for a missing object, this turns "metadata not found" into 500 instead of 404, and the reader == nil branch below becomes unreachable dead code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants