hermes-compression-eval

Offline evaluation harness for agent/context_compressor.py in hermes-agent. Runs a real conversation fixture through ContextCompressor.compress(), asks the compressor model to answer probe questions from the compressed state, and has a judge model score each answer 0–5 on six dimensions (accuracy, context_awareness, artifact_trail, completeness, continuity, instruction_following).

Methodology adapted from Factory's December 2025 write-up Evaluating Compression. The scoreboard framing is not adopted.

Why this exists

agent/context_compressor.py decides what survives compression when a session exceeds the context-window threshold. Its prompts and template sections are tuned by hand. Until now there was no signal between "test suite green" and "a user hits a bad summary in production."

This harness gives that signal: edit the compressor prompt, re-run the eval, compare the per-dimension scores against a saved baseline.

Costs

LLM-graded and non-deterministic. Each probe = 1 continuation call + 1 grading call. A full run across the three checked-in fixtures with default settings runs ~30 probe pairs against your configured provider. Budget accordingly. Not appropriate for CI.

Install

git clone https://github.com/NousResearch/hermes-compression-eval.git
cd hermes-compression-eval
pip install -r requirements.txt   # openai, fire

The harness imports ContextCompressor and agent.redact from hermes-agent. Locate your hermes-agent checkout one of three ways (checked in this order):

HERMES_AGENT_ROOT=/path/to/hermes-agent — explicit override.
~/.hermes/hermes-agent/ — the default location hermes setup writes.
Sibling directory: clone hermes-agent next to hermes-compression-eval.

Usage

# Baseline run (writes results/baseline/)
python3 run_eval.py \
    --compressor-provider=nous --compressor-model=openai/gpt-5.4-mini \
    --judge-provider=nous      --judge-model=openai/gpt-5.4-mini \
    --runs=3 --label=baseline

# After editing context_compressor.py prompts, compare:
python3 run_eval.py \
    --compressor-provider=nous --compressor-model=openai/gpt-5.4-mini \
    --judge-provider=nous      --judge-model=openai/gpt-5.4-mini \
    --runs=3 --label=my-tweak \
    --compare-to=results/baseline

results/<label>/report.md is paste-ready for a PR body. Per-run JSON goes to results/<label>/runs/.

What ships

Path	Purpose
`run_eval.py`	Fire CLI — the entry point
`compressor_driver.py`	Thin wrapper that forces a single-shot compress() over fixture messages
`grader.py`	Two-phase continuation + grading via the OpenAI SDK
`rubric.py`	Six-dimension scoring rubric, judge-prompt builder, JSON parser
`report.py`	Markdown report rendering + `--compare-to` delta mode
`scrub_fixtures.py`	Pipeline to convert real `~/.hermes/sessions/*.jsonl` into public-safe JSON fixtures
`fixtures/`	Three checked-in scrubbed sessions (feature-impl, debug, config-build)
`probes/`	Three probe banks, 10–11 probes each, covering recall / artifact / continuation / decision
`tests/`	33 hermetic unit tests for non-LLM paths

Adding a fixture

Pick a session under ~/.hermes/sessions/*.jsonl worth measuring.
Add a SPECS entry in scrub_fixtures.py (source filename, output name, description, user-message paraphrase, model guess, context length, optional truncate-at).
Run python3 scrub_fixtures.py — writes fixtures/<name>.json.
Add a probe bank at probes/<name>.probes.json covering all four types (recall, artifact, continuation, decision).
Re-run python3 -m pytest tests/ -q to verify it loads and parses.

See DESIGN.md for the full scrubber pipeline and probe-format spec.

Tests

python3 -m pytest tests/ -q

33 hermetic tests cover rubric parsing edge cases, judge-prompt building, report rendering, summariser medians, per-run JSON roundtrip, fixture and probe loading, and a PII smoke check on the checked-in fixtures.

The LLM paths (continuation + grading) require credentials and real API calls; they're exercised by running the eval itself, not by these tests.

License

MIT, same as hermes-agent.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

hermes-compression-eval

Why this exists

Costs

Install

Usage

What ships

Adding a fixture

Tests

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
fixtures		fixtures
probes		probes
results		results
tests		tests
.gitignore		.gitignore
DESIGN.md		DESIGN.md
README.md		README.md
compressor_driver.py		compressor_driver.py
grader.py		grader.py
report.py		report.py
requirements.txt		requirements.txt
rubric.py		rubric.py
run_eval.py		run_eval.py
scrub_fixtures.py		scrub_fixtures.py

Folders and files

Latest commit

History

Repository files navigation

hermes-compression-eval

Why this exists

Costs

Install

Usage

What ships

Adding a fixture

Tests

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages