Lightweight framework to benchmark and compare LLMs on structured tasks. Run eval suites, get a leaderboard, spot regressions.
| Suite | Task | Metric |
|---|---|---|
factual |
20 factual Q&A questions | Exact/contains match |
reasoning |
15 math & logic problems | Exact match |
extraction |
15 JSON extraction tasks | Field-level accuracy |
pip install -r requirements.txt
cp .env.example .env # add your API keys# benchmark two models across all evals
python cli.py run --models claude-haiku-4-5-20251001 claude-sonnet-4-6
# quick test with 5 items per eval
python cli.py run --models gpt-4o-mini --limit 5 --detail
# save results and view leaderboard later
python cli.py run --models claude-haiku-4-5-20251001 gpt-4o-mini --out results.json
python cli.py leaderboard results.json## Leaderboard
| Model | Factual | Reasoning | Extraction | **Avg** | Latency |
| --- | :---: | :---: | :---: | :---: | :---: |
| `claude-sonnet-4-6` | 95% | 87% | 82% | **88%** | 1240ms |
| `claude-haiku-4-5-20251001` | 90% | 80% | 74% | **81%** | 680ms |
| `gpt-4o-mini` | 85% | 73% | 71% | **76%** | 590ms |
Add your own eval suite in 3 steps:
- Create
datasets/myeval.json— list of{..., "expected": ...}objects - Create
evals/myeval.pywithload(),prompt_template(),score()functions - Add
"myeval": "evals.myeval"toEVAL_MODULESinsrc/runner.py
from src.runner import run_eval, run_suite
from src.reporter import leaderboard
results = run_suite(
models=["claude-haiku-4-5-20251001", "gpt-4o-mini"],
evals=["factual", "reasoning"],
limit=5,
)
print(leaderboard(results))