Skip to content

Add YOLO26 pose inference support#6

Open
Tar-ive wants to merge 1 commit into
thewebAI:mainfrom
Tar-ive:codex/pose26-mlx
Open

Add YOLO26 pose inference support#6
Tar-ive wants to merge 1 commit into
thewebAI:mainfrom
Tar-ive:codex/pose26-mlx

Conversation

@Tar-ive
Copy link
Copy Markdown

@Tar-ive Tar-ive commented May 22, 2026

Summary

This PR adds the minimum YOLO26 pose inference path needed to run yolo26n-pose in the pure MLX runtime. This is part of the support we need for our YOLO pose model in the YOLO26 MLX Build Challenge: we need to run a camera-facing pose model on Apple Silicon through MLX, then use the decoded keypoints in an alarm/gesture application.

The change is inference-only. It does not add pose training or pose loss support.

What changed

  • Added a packaged yolo26-pose.yaml config for the nano pose model path.
  • Added Pose26, matching the YOLO26 pose head layout with separate keypoint and sigma branches.
  • Registered Pose26 in model parsing/imports so the YAML can build the model graph.
  • Added pose filename/task routing so *-pose models select the pose YAML.
  • Added exact PyTorch-to-MLX name mapping for the pose-specific head weights:
    • cv4_kpts
    • cv4_sigma
    • one2one_cv4_kpts
    • one2one_cv4_sigma
  • Added keypoint decode and end-to-end pose postprocess so pose outputs return Boxes and Keypoints.
  • Added focused pose tests for model construction, forward output shape, weight-name mapping, and postprocess keypoint scaling.

Implementation steps

  1. Compared the Ultralytics YOLO26 pose head against the existing MLX YOLO26 detection head.
  2. Mapped the pose head structure into MLX using the existing Detect/Pose conventions already in this repo.
  3. Added the YOLO26 pose YAML and registered Pose26 in the task parser.
  4. Extended weight-name conversion so converted yolo26n-pose.pt weights load into the MLX graph.
  5. Implemented keypoint decode from anchor-relative outputs to image coordinates.
  6. Implemented predictor-level pose postprocessing to unletterbox keypoints and return Keypoints alongside boxes.
  7. Verified converted .npz loading and checked that pose-specific keys were not missing.
  8. Compared MLX pose outputs against Ultralytics PyTorch outputs on the same frames.

Local verification

Commands run locally on this branch:

ruff check src tests
ruff format --check src tests
python -m pytest tests/test_pose26.py -q
python -m pytest tests/test_convert_helpers.py tests/test_cli.py tests/test_tracking.py tests/test_botsort.py -q

Results:

ruff: passed
pose tests: 4 passed
conversion/cli/tracking subset: 52 passed

I also ran the full test suite. It has one pre-existing segmentation metric assertion unrelated to this PR:

127 passed, 1 failed
FAILED tests/test_segmentation.py::TestSegmentationMetrics::test_perfect_predictions
expected mAP50_mask == 1.0, got 0.995

MLX vs PyTorch pose comparison

I benchmarked this on the laptop I am developing on (MacBook M3- Pro 18GB RAM) using the same pattern as the existing inference benchmark scripts: warmup runs, timed end-to-end model.predict, calculate_stats, speedup calculation, device info, and JSON-style result output.

Dataset used for this local comparison:

/Users/tarive/6_7/data/frames
150 image frames
imgsz=640
conf=0.25
warmup=3

Weights:

MLX: /Users/tarive/6_7/models/yolo26n-pose.npz
PyTorch: /Users/tarive/6_7/models/yolo26n-pose.pt
PyTorch device: mps

Timing summary across all 150 frames:

Model MLX mean PyTorch MPS mean MLX FPS PyTorch MPS FPS MLX vs MPS
yolo26n-pose 13.97 ms 14.72 ms 71.6 68.0 1.05x

Pose parity summary across all 150 frames:

Metric Result
Detection count agreement 149 / 150
Matched non-empty pose frames 150
Mean keypoint pixel difference 2.1618 px
Median keypoint pixel difference 2.0768 px
Max frame keypoint pixel difference 17.9041 px
Mean box absolute pixel difference 1.6756 px

Key 6_7 gesture joints (part of the hackathon project), mean pixel difference vs PyTorch MPS:

Keypoint Mean pixel difference
left shoulder 1.2539 px
right shoulder 1.4638 px
left elbow 2.2564 px
right elbow 3.1389 px
left wrist 1.9493 px
right wrist 2.2545 px

This confirms that the MLX pose decode is landing keypoints on the same body parts as the PyTorch model, with small pixel-level differences across the full local 150-frame sample.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant