Skip to content

biswajeetdev/llm-evals

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

llm-evals

Lightweight framework to benchmark and compare LLMs on structured tasks. Run eval suites, get a leaderboard, spot regressions.

Eval Suites

Suite Task Metric
factual 20 factual Q&A questions Exact/contains match
reasoning 15 math & logic problems Exact match
extraction 15 JSON extraction tasks Field-level accuracy

Install

pip install -r requirements.txt
cp .env.example .env  # add your API keys

Run

# benchmark two models across all evals
python cli.py run --models claude-haiku-4-5-20251001 claude-sonnet-4-6

# quick test with 5 items per eval
python cli.py run --models gpt-4o-mini --limit 5 --detail

# save results and view leaderboard later
python cli.py run --models claude-haiku-4-5-20251001 gpt-4o-mini --out results.json
python cli.py leaderboard results.json

Example output

## Leaderboard

| Model                     | Factual | Reasoning | Extraction | **Avg** | Latency |
| ---                       | :---:   | :---:     | :---:      | :---:   | :---:   |
| `claude-sonnet-4-6`       | 95%     | 87%       | 82%        | **88%** | 1240ms  |
| `claude-haiku-4-5-20251001` | 90%   | 80%       | 74%        | **81%** | 680ms   |
| `gpt-4o-mini`             | 85%     | 73%       | 71%        | **76%** | 590ms   |

Extend

Add your own eval suite in 3 steps:

  1. Create datasets/myeval.json — list of {..., "expected": ...} objects
  2. Create evals/myeval.py with load(), prompt_template(), score() functions
  3. Add "myeval": "evals.myeval" to EVAL_MODULES in src/runner.py

Python API

from src.runner import run_eval, run_suite
from src.reporter import leaderboard

results = run_suite(
    models=["claude-haiku-4-5-20251001", "gpt-4o-mini"],
    evals=["factual", "reasoning"],
    limit=5,
)
print(leaderboard(results))

About

Benchmark and compare LLMs on factual Q&A, reasoning, and structured extraction tasks.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages