[codex] Fix reusable eval artifact validation by Oseltamivir · Pull Request #1841 · SemiAnalysisAI/InferenceX

Oseltamivir · 2026-06-19T01:23:46Z

Summary

validate raw eval artifacts from their meta_env.json logical matrix identity
decouple reuse validation from physical GitHub runner names
add a regression for logical mi300x jobs running on mi300x-amds_04
keep failure tests focused on behavior instead of exact diagnostic strings
skip the test workflow's PR changelog validation when perf-changelog.yaml is unchanged

Root cause

Merge run 27798519988 reused a successful PR sweep, but job 82263580083 rejected its eval artifacts. The validator built expected artifact prefixes with the logical config runner (mi300x), while artifact names contain the physical runner (mi300x-amds_01 or mi300x-amds_04). Valid artifacts were therefore reported as both missing and unexpected.

The initial PR check also failed after all tests passed because test-changelog-gate.yml unconditionally validated the associated PR's changelog diff. This PR does not modify perf-changelog.yaml, so the validator correctly reported that the file had no changes. The workflow now skips only that final validation when the PR has no changelog diff.

Impact

Merge-time ingest can now reuse successful eval artifacts from pooled self-hosted runners without depending on runner-name formatting. Missing, unexpected, and duplicate eval identities are still rejected using the artifact metadata. Changelog-changing PRs retain full append-only validation.

Validation

uv run --with pytest --with pydantic --with pyyaml python -m pytest utils/changelog_gate_tests/test_validate_perf_changelog.py utils/test_find_reusable_sweep_run.py utils/changelog_gate_tests/test_prepare_perf_changelog_merge.py utils/changelog_gate_tests/test_recover_failed_ingest.py utils/test_validate_reusable_sweep_artifacts.py utils/changelog_gate_tests/test_run_sweep_gating.py -v (97 passed)
uv run --with pytest --with pydantic --with pyyaml python -m pytest utils/matrix_logic/ -v (156 passed)
actionlint .github/workflows/test-changelog-gate.yml
replayed validation against downloaded artifacts from source run 27794193288 (passed)

Note

Medium Risk
Touches merge/reuse artifact gating and CI changelog checks for sweep workflows; incorrect logic could allow bad reuse or skip needed validation, but scope is limited to eval artifact matching and optional PR changelog diff.

Overview
Fixes merge-time reuse rejecting valid eval artifacts when directory names use physical runner labels (e.g. mi300x-amds_04) while the matrix expects logical runners (mi300x).

Reusable sweep eval validation no longer matches expected eval_* directory prefixes. It reads each raw eval dir’s meta_env.json, builds the same logical identity as aggregates (expected_eval_keys / eval_key), and still enforces exact coverage for raw dirs and eval_results_all (missing, unexpected, duplicates). eval_key also accepts infmax_model_prefix when model_prefix is absent.

Test changelog workflow skips the associated-PR validate_perf_changelog.py step when the PR diff does not touch perf-changelog.yaml, so infra-only PRs are not failed for “no changelog changes.”

Tests were updated with meta_env.json-based fixtures (including logical mi300x on physical mi300x-amds_04), looser assertions on error text, and a workflow regression for the changelog skip.

^{Reviewed by Cursor Bugbot for commit 44421d2. Bugbot is set up for automated code reviews on this repo. Configure here.}

This reverts commit a3f588c.

claude · 2026-06-19T01:39:15Z

Claude finished @Oseltamivir's task in 1m 2s —— View job

Reviewing PR #1841

Read repo guidelines (CLAUDE.md)
Gather context and read changed files
Review code for bugs, logic errors, and issues
Post review summary

LGTM — no blocking issues found.

The approach is sound: replacing brittle prefix-matching against physical runner names with logical identity extraction from meta_env.json is the right fix. The eval_key() fallback to infmax_model_prefix correctly handles the metadata schema difference, raw_eval_key_rows() has proper error handling for missing/invalid metadata, and the regression test for mi300x on mi300x-amds_04 directly validates the root cause. The workflow skip for unchanged perf-changelog.yaml is a clean guard.

fix: validate reused eval artifacts by metadata

ebb4b2a

github-project-automation Bot added this to InferenceMAX Board Jun 19, 2026

Oseltamivir added 4 commits June 19, 2026 09:29

fix: skip changelog validation when unchanged

9047d33

test: harden changelog gate checks

a3f588c

Revert "test: harden changelog gate checks"

a277907

This reverts commit a3f588c.

Merge branch 'main' into codex/fix-reuse-eval-artifact-validation

44421d2

Oseltamivir marked this pull request as ready for review June 19, 2026 01:38

Oseltamivir requested a review from a team June 19, 2026 01:38

Oseltamivir merged commit be9f2a7 into main Jun 19, 2026
6 checks passed

Oseltamivir deleted the codex/fix-reuse-eval-artifact-validation branch June 19, 2026 01:39

github-project-automation Bot moved this to Done in InferenceMAX Board Jun 19, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[codex] Fix reusable eval artifact validation#1841

[codex] Fix reusable eval artifact validation#1841
Oseltamivir merged 5 commits into
mainfrom
codex/fix-reuse-eval-artifact-validation

Oseltamivir commented Jun 19, 2026 •

edited by cursor Bot

Loading

Uh oh!

claude Bot commented Jun 19, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

Oseltamivir commented Jun 19, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root cause

Impact

Validation

Uh oh!

claude Bot commented Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewing PR #1841

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Oseltamivir commented Jun 19, 2026 •

edited by cursor Bot

Loading

claude Bot commented Jun 19, 2026 •

edited

Loading