Skip to content

ox doctor: 11 registered checks never invoked by default run (DoctorCheckRegistry is metadata-only) #498

@galexy

Description

@galexy

Summary

cmd/ox/doctor_types.go defines a DoctorCheckRegistry map that every doctor check registers into via RegisterDoctorCheck() at init time. The registry looks like the source of truth for doctor execution, and cmd/ox/doctor_registry_example.go:77-79 documents a "Phase 5" plan to have runDoctorChecks iterate DefaultRegistry.RunAll().

That migration was never completed. runDoctorChecks() in cmd/ox/doctor.go hand-wires execution via literal categoryChecks := []checkResult{checkFoo(...), checkBar(...)} slices, calling each check function by name. The DoctorCheckRegistry map is only consulted after execution by enrichCheckResult() to attach slug/fixLevel metadata to the already-computed checkResult.

Consequence

A developer adding a new check:

  1. Writes checkFoo() in doctor_foo.go
  2. Calls RegisterDoctorCheck(DoctorCheck{Slug: CheckSlugFoo, ...}) in the file's init()
  3. Forgets to also add checkFoo(...) to the relevant categoryChecks slice in doctor.go

Result: the check is fully registered, has unit test coverage (tests call it directly), shows up under ox doctor --fix-slug=foo (which uses targeted dispatch — a separate code path that DOES consult the registry), but is silently skipped by bare ox doctor / ox doctor --fix. No error. No warning. The check's tests pass. It simply never runs in production.

How it was discovered

The journal-timezone revert added checkTimezoneScrub to auto-scrub dead timezone: keys from .sageox/config.json and team config.toml during the team-timezone feature deletion. The unit test TestCheckTimezoneScrub_ScrubsProjectAndTeamConfigs passed because it called checkTimezoneScrub() directly. The E2E test TZ-04 failed because it invoked the full ox doctor --fix --yes pipeline as a subprocess and the scrub never happened.

The fix for the tz revert case was a one-line wire-up to runDoctorChecks (separate commit). But a systemic scan of DoctorCheckRegistry vs. the runDoctorChecks call sites found 10 more checks with the same pattern.

Confirmed dormant checks (registered but never invoked by default)

All 10 are reachable via ox doctor --fix-slug=<slug> (targeted path), but ox doctor / ox doctor --fix skips them:

Slug constant Function Defined in
CheckSlugLedgerPathMismatch checkLedgerPathMismatch doctor_git_repos.go:1235
CheckSlugSessionStartHookBug checkSessionStartHookBug doctor_hooks.go:31
CheckSlugEndpointNormalization checkEndpointNormalization doctor_sageox.go:1047
CheckSlugDuplicateRepoMarkers checkDuplicateRepoMarkers doctor_sageox.go:1268
CheckSlugScoreThresholdRange checkScoreThresholdRange doctor_project.go:214
CheckSlugGCBlockedUntracked checkGCBlockedByUntracked doctor_team.go:305
CheckSlugTeamSparseCheckout checkTeamSparseCheckout doctor_team.go:402
CheckSlugLedgerSparseCheckout checkLedgerSparseCheckout doctor_ledger_infra.go:18
CheckSlugCodeDBConsistency checkCodeDBConsistency doctor_ledger_infra.go:82
CheckSlugDaemonDedup checkDaemonDeduplication doctor_daemon_dedup.go:15

The worst offenders are checkEndpointNormalization and checkDuplicateRepoMarkers — those catch real outage-causing misconfigurations per .claude/rules/endpoints.md and have been effectively dormant.

Fix options

(a) Finish Phase 5: make runDoctorChecks iterate the registry

The real fix. Convert runDoctorChecks() to iterate DoctorCheckRegistry (or its Category-indexed view) and call each entry, rather than hand-wiring category slices. This touches the execution engine and risks reordering behavior or accidentally activating a long-dormant check in production that no one noticed was broken. Needs a careful rollout — possibly a feature flag for the transition or a per-check audit as each is turned on, since some of the 10 dormant checks may have bit-rotted while no one was running them.

(b) Cheap guard: compile-time or test-time assertion

Add a test that asserts every RegisterDoctorCheck() slug has a corresponding call site visible to the compiler. Can be implemented as a go test that walks the AST of doctor.go looking for checkFoo(...) calls matching the registered function names, or as a build-time check. Doesn't fix the 11 dormant checks — just prevents adding a 12th.

Suggested order

Do (b) first to prevent regression, then tackle (a) incrementally with a per-check audit pass. Each dormant check should be individually verified before activation: does it still work? Does it produce false positives? Does its auto-fix behavior match current expectations?

Context

Discovered during the team-timezone revert PR (internal beads: ox-b27, found alongside the ox-ykz one-line fix for checkTimezoneScrub). The revert PR deliberately scoped this out per CLAUDE.md's "revert PR is not cleanup time" rule; fixing it in the same PR would blow scope and make bisect confusing. Filing here so the work can be picked up properly.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions