NousResearch/hermes-compression-eval
Python
Captured source
source ↗NousResearch/hermes-compression-eval
Description: Offline probe-based evaluation harness for hermes-agent's ContextCompressor. Methodology adapted from Factory's Dec 2025 'Evaluating Compression'.
Language: Python
Stars: 9
Forks: 6
Open issues: 1
Created: 2026-05-16T09:41:37Z
Pushed: 2026-05-16T09:41:39Z
Default branch: main
Fork: no
Archived: no
README:
hermes-compression-eval
Offline evaluation harness for agent/context_compressor.py in hermes-agent. Runs a real conversation fixture through ContextCompressor.compress(), asks the compressor model to answer probe questions from the compressed state, and has a judge model score each answer 0–5 on six dimensions (accuracy, context_awareness, artifact_trail, completeness, continuity, instruction_following).
Methodology adapted from Factory's December 2025 write-up *Evaluating Compression*. The scoreboard framing is not adopted.
Why this exists
agent/context_compressor.py decides what survives compression when a session exceeds the context-window threshold. Its prompts and template sections are tuned by hand. Until now there was no signal between *"test suite green"* and *"a user hits a bad summary in production."*
This harness gives that signal: edit the compressor prompt, re-run the eval, compare the per-dimension scores against a saved baseline.
Costs
LLM-graded and non-deterministic. Each probe = 1 continuation call + 1 grading call. A full run across the three checked-in fixtures with default settings runs ~30 probe pairs against your configured provider. Budget accordingly. Not appropriate for CI.
Install
git clone https://github.com/NousResearch/hermes-compression-eval.git cd hermes-compression-eval pip install -r requirements.txt # openai, fire
The harness imports ContextCompressor and agent.redact from hermes-agent. Locate your hermes-agent checkout one of three ways (checked in this order):
1. HERMES_AGENT_ROOT=/path/to/hermes-agent — explicit override. 2. ~/.hermes/hermes-agent/ — the default location hermes setup writes. 3. Sibling directory: clone hermes-agent next to hermes-compression-eval.
Usage
# Baseline run (writes results/baseline/) python3 run_eval.py \ --compressor-provider=nous --compressor-model=openai/gpt-5.4-mini \ --judge-provider=nous --judge-model=openai/gpt-5.4-mini \ --runs=3 --label=baseline # After editing context_compressor.py prompts, compare: python3 run_eval.py \ --compressor-provider=nous --compressor-model=openai/gpt-5.4-mini \ --judge-provider=nous --judge-model=openai/gpt-5.4-mini \ --runs=3 --label=my-tweak \ --compare-to=results/baseline
results//report.md is paste-ready for a PR body. Per-run JSON goes to results//runs/.
What ships
| Path | Purpose | |---|---| | run_eval.py | Fire CLI — the entry point | | compressor_driver.py | Thin wrapper that forces a single-shot compress() over fixture messages | | grader.py | Two-phase continuation + grading via the OpenAI SDK | | rubric.py | Six-dimension scoring rubric, judge-prompt builder, JSON parser | | report.py | Markdown report rendering + --compare-to delta mode | | scrub_fixtures.py | Pipeline to convert real ~/.hermes/sessions/*.jsonl into public-safe JSON fixtures | | fixtures/ | Three checked-in scrubbed sessions (feature-impl, debug, config-build) | | probes/ | Three probe banks, 10–11 probes each, covering recall / artifact / continuation / decision | | tests/ | 33 hermetic unit tests for non-LLM paths |
Adding a fixture
1. Pick a session under ~/.hermes/sessions/*.jsonl worth measuring. 2. Add a SPECS entry in scrub_fixtures.py (source filename, output name, description, user-message paraphrase, model guess, context length, optional truncate-at). 3. Run python3 scrub_fixtures.py — writes fixtures/.json. 4. Add a probe bank at probes/.probes.json covering all four types (recall, artifact, continuation, decision). 5. Re-run python3 -m pytest tests/ -q to verify it loads and parses.
See DESIGN.md for the full scrubber pipeline and probe-format spec.
Tests
python3 -m pytest tests/ -q
33 hermetic tests cover rubric parsing edge cases, judge-prompt building, report rendering, summariser medians, per-run JSON roundtrip, fixture and probe loading, and a PII smoke check on the checked-in fixtures.
The LLM paths (continuation + grading) require credentials and real API calls; they're exercised by running the eval itself, not by these tests.
License
MIT, same as hermes-agent.
Notability
notability 3.0/10Low star count, routine new repo