[HistoryServer] Propagate error on storage APIs#4858
Conversation
| // for metadata, fileName is generated by historyserver, which will obey the same rule as collector to make sure the historyserver can ready the right file. | ||
| // | ||
| GetContent(clusterId string, fileName string) io.Reader | ||
| GetContent(clusterId string, fileName string) (io.Reader, error) |
There was a problem hiding this comment.
Thanks @dentiny!
We will discuss the interface change in the next meeting.
There was a problem hiding this comment.
Sure - - PoAn is working on opendal-based storage interface, while reviewing his PR I surprising found quite a few APIs don't propagate error, not sure why 🤷
483e466 to
8dcb43c
Compare
| reader := s.reader.GetContent(rayClusterNameNamespace, logPath) | ||
| reader, err := s.reader.GetContent(rayClusterNameNamespace, logPath) | ||
| if err != nil { | ||
| return nil, utils.NewHTTPError(fmt.Errorf("failed to get log file %s: %w", logPath, err), http.StatusNotFound) |
There was a problem hiding this comment.
Should we always return 404 here? There could be timeout/network errors.
There was a problem hiding this comment.
- We don't use http error, or status code at storage layer (i.e., GCS/S3/etc), so it's hard to translate into http status code at reader
- Failures caused by timeout and transient network issues should retried internally at storage layer, so error propagated at reader should be non-retriable ones
- I updated the error code from not found to server internal, which is more general
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Reviewed by Cursor Bugbot for commit f3fb687. Configure here.
JiangJiaWei1103
left a comment
There was a problem hiding this comment.
Thanks for the effort!
The one thing blocking this for me: this PR treats every storage error as 500, but the storage layer does not actually distinguish "resource not found" from "storage failure". For example, "not found" and "S3 is down" arrive as the same undifferentiated error.
One possible fix: actually distinguish the two at the storage layer, then let the router decide:
// storage package
var ErrNotFound = errors.New("object not found")
// each backend wraps real not-found: fmt.Errorf("...: %w", storage.ErrNotFound)
// router:
if errors.Is(err, storage.ErrNotFound) {
// 404 for metadata; empty-JSON / 404 fallback for additional endpoints
} else {
// 500
}Another thing worth discussing (non-blocking): listFilesRecursive previously returned whatever was found, skipping failed subdirectory; now any single failure fails the whole request. This is a behavior change, and I'd like to hear whether that's expected. Thanks!
| logrus.Errorf("Failed to get additional endpoint data for %s: %v", req.Request.URL.Path, err) | ||
| resp.WriteErrorString(http.StatusInternalServerError, "Failed to retrieve endpoint data from storage") | ||
| return | ||
| } |
There was a problem hiding this comment.
This seems to change behavior for clusters that never enabled serve. The collector never stores /api/serve/applications data , so GetContent returns a non-nil err.
Before this PR, that err fell through to the reader == nil branch and emptyResponseForEndpoint() returned {"applications":{}} (as the comment on L1011-1013). After this PR it returns 500, and the frontend shows an error state.
A missing resource is a client-facing 404, not a server-internal 500 (500 tends to trigger alerts / client retries). We may need to distinguish the two.
| resp.WriteErrorString(http.StatusInternalServerError, "Failed to retrieve cluster metadata from storage") | ||
| return | ||
| } | ||
| if reader == nil { |
There was a problem hiding this comment.
Same root issue: Since every backend returns an err (not a nil reader) for a missing object, this turns "metadata not found" into 500 instead of 404, and the reader == nil branch below becomes unreachable dead code.

Why are these changes needed?
Error should be propagated up unless we have reason not to do so.
Checks