[HistoryServer] Propagate error on storage APIs by dentiny · Pull Request #4858 · ray-project/kuberay

dentiny · 2026-05-22T04:21:38Z

Why are these changes needed?

Error should be propagated up unless we have reason not to do so.

Checks

I've made sure the tests are passing.
Testing Strategy
- Unit tests
- Manual tests
- This PR is not tested :(

JiangJiaWei1103 · 2026-05-22T07:44:06Z

 	// for metadata, fileName is generated by historyserver, which will obey the same rule as collector to make sure the historyserver can ready the right file.
 	//
-	GetContent(clusterId string, fileName string) io.Reader
+	GetContent(clusterId string, fileName string) (io.Reader, error)


Thanks @dentiny!

We will discuss the interface change in the next meeting.

Sure - - PoAn is working on opendal-based storage interface, while reviewing his PR I surprising found quite a few APIs don't propagate error, not sure why 🤷

rueian · 2026-05-24T15:26:00Z

-	reader := s.reader.GetContent(rayClusterNameNamespace, logPath)
+	reader, err := s.reader.GetContent(rayClusterNameNamespace, logPath)
+	if err != nil {
+		return nil, utils.NewHTTPError(fmt.Errorf("failed to get log file %s: %w", logPath, err), http.StatusNotFound)


Should we always return 404 here? There could be timeout/network errors.

We don't use http error, or status code at storage layer (i.e., GCS/S3/etc), so it's hard to translate into http status code at reader

Failures caused by timeout and transient network issues should retried internally at storage layer, so error propagated at reader should be non-retriable ones

I updated the error code from not found to server internal, which is more general

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Reviewed by Cursor Bugbot for commit f3fb687. Configure here.}

JiangJiaWei1103

Thanks for the effort!

The one thing blocking this for me: this PR treats every storage error as 500, but the storage layer does not actually distinguish "resource not found" from "storage failure". For example, "not found" and "S3 is down" arrive as the same undifferentiated error.

One possible fix: actually distinguish the two at the storage layer, then let the router decide:

// storage package
var ErrNotFound = errors.New("object not found")
// each backend wraps real not-found: fmt.Errorf("...: %w", storage.ErrNotFound)

// router:
if errors.Is(err, storage.ErrNotFound) {
    // 404 for metadata; empty-JSON / 404 fallback for additional endpoints
} else {
    // 500
}

Another thing worth discussing (non-blocking): listFilesRecursive previously returned whatever was found, skipping failed subdirectory; now any single failure fails the whole request. This is a behavior change, and I'd like to hear whether that's expected. Thanks!

JiangJiaWei1103 · 2026-05-30T13:31:57Z

+		logrus.Errorf("Failed to get additional endpoint data for %s: %v", req.Request.URL.Path, err)
+		resp.WriteErrorString(http.StatusInternalServerError, "Failed to retrieve endpoint data from storage")
+		return
+	}


This seems to change behavior for clusters that never enabled serve. The collector never stores /api/serve/applications data , so GetContent returns a non-nil err.

Before this PR, that err fell through to the reader == nil branch and emptyResponseForEndpoint() returned {"applications":{}} (as the comment on L1011-1013). After this PR it returns 500, and the frontend shows an error state.

A missing resource is a client-facing 404, not a server-internal 500 (500 tends to trigger alerts / client retries). We may need to distinguish the two.

JiangJiaWei1103 · 2026-05-30T13:32:27Z

+		resp.WriteErrorString(http.StatusInternalServerError, "Failed to retrieve cluster metadata from storage")
+		return
+	}
 	if reader == nil {


Same root issue: Since every backend returns an err (not a nil reader) for a missing object, this turns "metadata not found" into 500 instead of 404, and the reader == nil branch below becomes unreachable dead code.

[HistoryServer] Propagate error on read

0f46876

cursor Bot reviewed May 22, 2026

View reviewed changes

Comment thread historyserver/pkg/historyserver/router.go

dentiny added 2 commits May 21, 2026 21:42

list as well

05d350d

comment address

11bc08e

cursor Bot reviewed May 22, 2026

View reviewed changes

Comment thread historyserver/pkg/historyserver/router.go

comment

e8ebb87

JiangJiaWei1103 reviewed May 22, 2026

View reviewed changes

dentiny changed the title ~~[HistoryServer] Propagate error on read~~ [HistoryServer] Propagate error on storage APIs May 22, 2026

dentiny requested a review from JiangJiaWei1103 May 24, 2026 13:20

Merge upstream/master into master

8dcb43c

dentiny force-pushed the hjiang/propagate-error-on-read branch from 483e466 to 8dcb43c Compare May 24, 2026 13:35

cursor Bot reviewed May 24, 2026

View reviewed changes

Comment thread historyserver/pkg/historyserver/reader.go

list error prop

08aaa0b

cursor Bot reviewed May 24, 2026

View reviewed changes

Comment thread historyserver/pkg/historyserver/router.go

server internal err

bd2ba08

rueian reviewed May 24, 2026

View reviewed changes

server error

b0576b0

dentiny requested a review from rueian May 24, 2026 23:04

cursor Bot reviewed May 24, 2026

View reviewed changes

Comment thread historyserver/pkg/historyserver/router.go Outdated

consistent error

f3fb687

cursor Bot reviewed May 24, 2026

View reviewed changes

Comment thread historyserver/pkg/historyserver/reader.go

JiangJiaWei1103 reviewed May 30, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[HistoryServer] Propagate error on storage APIs#4858

[HistoryServer] Propagate error on storage APIs#4858
dentiny wants to merge 9 commits into
ray-project:masterfrom
dentiny:hjiang/propagate-error-on-read

dentiny commented May 22, 2026 •

edited by Future-Outlier

Loading

Uh oh!

Uh oh!

Uh oh!

JiangJiaWei1103 May 22, 2026

Uh oh!

dentiny May 22, 2026

Uh oh!

Uh oh!

Uh oh!

rueian May 24, 2026

Uh oh!

dentiny May 24, 2026

Uh oh!

Uh oh!

cursor Bot left a comment

Uh oh!

Uh oh!

JiangJiaWei1103 left a comment

Uh oh!

JiangJiaWei1103 May 30, 2026

Uh oh!

JiangJiaWei1103 May 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

dentiny commented May 22, 2026 • edited by Future-Outlier Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

Checks

Uh oh!

Uh oh!

Uh oh!

JiangJiaWei1103 May 22, 2026

Choose a reason for hiding this comment

Uh oh!

dentiny May 22, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

rueian May 24, 2026

Choose a reason for hiding this comment

Uh oh!

dentiny May 24, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

JiangJiaWei1103 left a comment

Choose a reason for hiding this comment

Uh oh!

JiangJiaWei1103 May 30, 2026

Choose a reason for hiding this comment

Uh oh!

JiangJiaWei1103 May 30, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dentiny commented May 22, 2026 •

edited by Future-Outlier

Loading