Skip to content

[S-TIR][MetaSchedule] Make evolutionary search resilient to trace replay failures#19438

Merged
tlopex merged 3 commits into
apache:mainfrom
cchung100m:issue-17934
Apr 25, 2026
Merged

[S-TIR][MetaSchedule] Make evolutionary search resilient to trace replay failures#19438
tlopex merged 3 commits into
apache:mainfrom
cchung100m:issue-17934

Conversation

@cchung100m

@cchung100m cchung100m commented Apr 24, 2026

Copy link
Copy Markdown
Contributor

Hi Committers,

This PR is trying to fix issues #17934. Any suggestions would be appreciated if you are available.

Root Cause

  • During EvolutionarySearch candidate generation, trace->ApplyToSchedule(...) could throw ScheduleError.
  • The exception was propagated through parallel execution and aborted tuning.
  • Error handling was inconsistent between measured and unmeasured paths, and failure visibility was limited.

Solutions

  • Catch trace replay failures in ThreadedTraceApply::Apply and return nullopt instead of crashing.
  • Add trace replay failure counting (trace_fail_counter_) and accessor (TraceFailCount()).
  • Align measured path PickBestFromDatabase with unmeasured behavior: skip invalid candidates and continue.
  • Add visible WARNING logs when trace replay failures occur (to avoid silent failures).

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enhances the robustness of the evolutionary search strategy by gracefully handling trace replay failures. Key changes include wrapping schedule application in try-catch blocks within ThreadedTraceApply, introducing an atomic counter to track these failures, and updating PickBestFromDatabase and SampleInitPopulation to log warnings and filter out invalid schedules instead of terminating. The review feedback suggests replacing DLOG with TVM_PY_LOG to ensure that detailed failure information is visible in production builds as well as debug builds.

Comment thread src/s_tir/meta_schedule/utils.h Outdated
Comment thread src/s_tir/meta_schedule/utils.h Outdated
@cchung100m cchung100m marked this pull request as ready for review April 24, 2026 16:00
@cchung100m

Copy link
Copy Markdown
Contributor Author

Hi @tlopex @mshr-h

This PR is trying to fix issues #17934. Any suggestions would be appreciated if you are available.

@tlopex tlopex left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM. Could you have a look at the review of Gemini and fix them as Gemini suggested?

@cchung100m

Copy link
Copy Markdown
Contributor Author

Hi @tlopex
Thanks for the prompt reply. I updated the part you mentioned. 😄

@tlopex tlopex merged commit 0a0dd31 into apache:main Apr 25, 2026
9 checks passed
@cchung100m cchung100m deleted the issue-17934 branch April 26, 2026 00:56
@cchung100m

Copy link
Copy Markdown
Contributor Author

Thanks to @tlopex 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants