PatchGym

Turn any Git repository into a local SWE-bench-style coding-agent benchmark.

PatchGym mines real Git history, creates hidden-test coding-agent tasks, runs agents against those tasks, and reports whether their patches actually fixed the code.

PatchGym is alpha software: local-first, practical, research-quality, and designed to be read. It is not a hosted leaderboard, not a cloud service, and not a claim that one model or agent wins everywhere.

Install From Source

PatchGym is not published to PyPI. Install it from a source checkout:

git clone https://github.com/nripankadas07/patchgym
cd patchgym
python3 -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip
python -m pip install -e ".[dev]"

60-Second Demo

bash scripts/demo.sh

The demo creates a tiny Git repository, mines one historical bug fix, verifies the hidden-test/oracle split, runs a toy shell agent, grades the patch, and writes:

.patchgym/reports/report.json
.patchgym/reports/report.md
.patchgym/reports/index.html

Expected shape:

mined 1 task(s)
built 1/1 valid task(s)
agent 'bash .../examples/custom_agent/agent.sh' solved 1/1 task(s)
PatchGym demo complete

Flagship Evidence

PatchGym is the flagship because it is a concrete evaluation loop, not just a demo wrapper. It mines real Git history, proves task validity with hidden tests and an oracle patch, runs an agent command in a fresh workspace, and writes auditable reports.

Reviewer-oriented entry points:

Benchmark page: what PatchGym measures, what it does not claim, and how the local benchmark invariant works.
Comparisons: a table against public benchmarks, repo-to-prompt tools, coding agents, and plain test runners.
90-second walkthrough script: the short recorded-demo script for explaining the project quickly and honestly.

How It Works

For each selected historical commit, PatchGym splits the change into:

base commit,
hidden test patch,
oracle solution patch,
task prompt,
validation command.

A task is valid only when:

base + hidden tests fails
base + hidden tests + oracle patch passes

During an agent run, PatchGym exports the base commit into a temporary workspace, runs the agent command there, captures the agent diff, applies hidden tests, runs the validation command, and records the result.

Deeper docs:

CLI

patchgym init
patchgym mine .
patchgym build .
patchgym list
patchgym show <task-id>
patchgym verify <task-id>
patchgym context <task-id>
patchgym run <task-id> --agent "bash examples/custom_agent/agent.sh"
patchgym grade
patchgym report
patchgym replay <task-id>

Older path-oriented usage also works:

patchgym mine /path/to/repo --out .patchgym/tasks --validation "python -m pytest -q"
patchgym verify .patchgym/tasks --repo /path/to/repo
patchgym run .patchgym/tasks --repo /path/to/repo --agent noop

Example Task

A task directory looks like:

.patchgym/tasks/<task-id>/
  task.json
  hidden_tests.patch
  oracle_solution.patch
  context/
    CODEX_TASK.md
    AGENTS.md

The agent receives the prompt/context, not the hidden tests or oracle patch. Maintainers can inspect the oracle patch to audit task quality.

Example Report

patchgym report writes JSON, Markdown, and HTML. The Markdown report includes:

tasks generated,
pass/fail result,
changed files,
validation command,
duration,
local execution safety note.

Each run directory also writes:

manifest.json: task commits, hidden-test/oracle patch hashes, artifact hashes, changed files, return codes, and totals.
trace.jsonl: a TraceWeave-compatible event stream for agent execution, patch capture, hidden-test application, validation, and grading.

These files are designed to be ingested by SandboxLedger and analyzed by TraceWeave without giving the agent hidden tests or oracle patches.

Safety Warning

PatchGym runs local Git commands, validation commands, tests, and explicit user-provided agent shell commands. Do not run it on untrusted repositories or with untrusted agents unless you use a disposable container, VM, or machine.

shell=True is only used for the explicit agent command. Validation commands are split and executed without a shell. Agent and validation commands both have timeouts.

Limitations

PatchGym works most reliably when tests and fixes land in the same commit. It uses path heuristics to identify test files. It does not provide strong isolation by default. It does not claim public leaderboard readiness or a complete public-benchmark export format.

More limitations are documented in docs/limitations.md.

Comparison

SWE-bench-style public benchmarks are valuable for broad comparison. PatchGym asks a narrower local question: can an agent fix tasks mined from your repository, under your tests, using your project history?

PatchGym is smaller and less comprehensive than public benchmark infrastructure. That is intentional: it is a readable reference harness and a practical local evaluation loop.

Development

python -m pip install --upgrade pip
python -m pip install -e ".[dev]"
ruff check .
pytest -q
python -m build
bash scripts/demo.sh

CI runs the same core gates across Python 3.9 through 3.13: CLI smoke tests, Ruff, pytest, build, wheel install, and demo.

Roadmap

See ROADMAP.md.

License

MIT. See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.github		.github
assets		assets
docs		docs
examples		examples
launch		launch
scripts		scripts
src/patchgym		src/patchgym
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CITATION.cff		CITATION.cff
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
FINAL_QA_REPORT.md		FINAL_QA_REPORT.md
FINAL_REPORT.md		FINAL_REPORT.md
LICENSE		LICENSE
MANUAL_ACTIONS.md		MANUAL_ACTIONS.md
Makefile		Makefile
PUBLIC_VERIFICATION.md		PUBLIC_VERIFICATION.md
QUALITY.md		QUALITY.md
README.md		README.md
RELEASE_NOTES.md		RELEASE_NOTES.md
ROADMAP.md		ROADMAP.md
SECURITY.md		SECURITY.md
VALIDITY_PROOF.md		VALIDITY_PROOF.md
pyproject.toml		pyproject.toml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PatchGym

Install From Source

60-Second Demo

Flagship Evidence

How It Works

CLI

Example Task

Example Report

Safety Warning

Limitations

Comparison

Development

Roadmap

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PatchGym

Install From Source

60-Second Demo

Flagship Evidence

How It Works

CLI

Example Task

Example Report

Safety Warning

Limitations

Comparison

Development

Roadmap

License

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages