Skip to content

[codex] Fix reusable eval artifact validation#1841

Merged
Oseltamivir merged 5 commits into
mainfrom
codex/fix-reuse-eval-artifact-validation
Jun 19, 2026
Merged

[codex] Fix reusable eval artifact validation#1841
Oseltamivir merged 5 commits into
mainfrom
codex/fix-reuse-eval-artifact-validation

Conversation

@Oseltamivir

@Oseltamivir Oseltamivir commented Jun 19, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • validate raw eval artifacts from their meta_env.json logical matrix identity
  • decouple reuse validation from physical GitHub runner names
  • add a regression for logical mi300x jobs running on mi300x-amds_04
  • keep failure tests focused on behavior instead of exact diagnostic strings
  • skip the test workflow's PR changelog validation when perf-changelog.yaml is unchanged

Root cause

Merge run 27798519988 reused a successful PR sweep, but job 82263580083 rejected its eval artifacts. The validator built expected artifact prefixes with the logical config runner (mi300x), while artifact names contain the physical runner (mi300x-amds_01 or mi300x-amds_04). Valid artifacts were therefore reported as both missing and unexpected.

The initial PR check also failed after all tests passed because test-changelog-gate.yml unconditionally validated the associated PR's changelog diff. This PR does not modify perf-changelog.yaml, so the validator correctly reported that the file had no changes. The workflow now skips only that final validation when the PR has no changelog diff.

Impact

Merge-time ingest can now reuse successful eval artifacts from pooled self-hosted runners without depending on runner-name formatting. Missing, unexpected, and duplicate eval identities are still rejected using the artifact metadata. Changelog-changing PRs retain full append-only validation.

Validation

  • uv run --with pytest --with pydantic --with pyyaml python -m pytest utils/changelog_gate_tests/test_validate_perf_changelog.py utils/test_find_reusable_sweep_run.py utils/changelog_gate_tests/test_prepare_perf_changelog_merge.py utils/changelog_gate_tests/test_recover_failed_ingest.py utils/test_validate_reusable_sweep_artifacts.py utils/changelog_gate_tests/test_run_sweep_gating.py -v (97 passed)
  • uv run --with pytest --with pydantic --with pyyaml python -m pytest utils/matrix_logic/ -v (156 passed)
  • actionlint .github/workflows/test-changelog-gate.yml
  • replayed validation against downloaded artifacts from source run 27794193288 (passed)

Note

Medium Risk
Touches merge/reuse artifact gating and CI changelog checks for sweep workflows; incorrect logic could allow bad reuse or skip needed validation, but scope is limited to eval artifact matching and optional PR changelog diff.

Overview
Fixes merge-time reuse rejecting valid eval artifacts when directory names use physical runner labels (e.g. mi300x-amds_04) while the matrix expects logical runners (mi300x).

Reusable sweep eval validation no longer matches expected eval_* directory prefixes. It reads each raw eval dir’s meta_env.json, builds the same logical identity as aggregates (expected_eval_keys / eval_key), and still enforces exact coverage for raw dirs and eval_results_all (missing, unexpected, duplicates). eval_key also accepts infmax_model_prefix when model_prefix is absent.

Test changelog workflow skips the associated-PR validate_perf_changelog.py step when the PR diff does not touch perf-changelog.yaml, so infra-only PRs are not failed for “no changelog changes.”

Tests were updated with meta_env.json-based fixtures (including logical mi300x on physical mi300x-amds_04), looser assertions on error text, and a workflow regression for the changelog skip.

Reviewed by Cursor Bugbot for commit 44421d2. Bugbot is set up for automated code reviews on this repo. Configure here.

@Oseltamivir Oseltamivir marked this pull request as ready for review June 19, 2026 01:38
@Oseltamivir Oseltamivir requested a review from a team June 19, 2026 01:38
@claude

claude Bot commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

Claude finished @Oseltamivir's task in 1m 2s —— View job


Reviewing PR #1841

  • Read repo guidelines (CLAUDE.md)
  • Gather context and read changed files
  • Review code for bugs, logic errors, and issues
  • Post review summary

LGTM — no blocking issues found.

The approach is sound: replacing brittle prefix-matching against physical runner names with logical identity extraction from meta_env.json is the right fix. The eval_key() fallback to infmax_model_prefix correctly handles the metadata schema difference, raw_eval_key_rows() has proper error handling for missing/invalid metadata, and the regression test for mi300x on mi300x-amds_04 directly validates the root cause. The workflow skip for unchanged perf-changelog.yaml is a clean guard.

@Oseltamivir Oseltamivir merged commit be9f2a7 into main Jun 19, 2026
6 checks passed
@Oseltamivir Oseltamivir deleted the codex/fix-reuse-eval-artifact-validation branch June 19, 2026 01:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Development

Successfully merging this pull request may close these issues.

1 participant