[S-TIR][MetaSchedule] Make evolutionary search resilient to trace replay failures#19438
Conversation
There was a problem hiding this comment.
Code Review
This pull request enhances the robustness of the evolutionary search strategy by gracefully handling trace replay failures. Key changes include wrapping schedule application in try-catch blocks within ThreadedTraceApply, introducing an atomic counter to track these failures, and updating PickBestFromDatabase and SampleInitPopulation to log warnings and filter out invalid schedules instead of terminating. The review feedback suggests replacing DLOG with TVM_PY_LOG to ensure that detailed failure information is visible in production builds as well as debug builds.
tlopex
left a comment
There was a problem hiding this comment.
Overall LGTM. Could you have a look at the review of Gemini and fix them as Gemini suggested?
|
Hi @tlopex |
|
Thanks to @tlopex 😄 |
Hi Committers,
This PR is trying to fix issues #17934. Any suggestions would be appreciated if you are available.
Root Cause
EvolutionarySearchcandidate generation,trace->ApplyToSchedule(...)could throwScheduleError.Solutions
ThreadedTraceApply::Applyand returnnulloptinstead of crashing.trace_fail_counter_) and accessor (TraceFailCount()).PickBestFromDatabasewith unmeasured behavior: skip invalid candidates and continue.WARNINGlogs when trace replay failures occur (to avoid silent failures).