amazon-science/StaminaBench
Python
Captured source
source ↗amazon-science/StaminaBench
Language: Python
License: NOASSERTION
Stars: 0
Forks: 0
Open issues: 5
Created: 2026-06-15T20:04:53Z
Pushed: 2026-06-20T00:24:05Z
Default branch: main
Fork: no
Archived: no
README:
StaminaBench: Stress-Testing Coding Agents over 100 Interaction Turns
[paper]()
StaminaBench is a framework for evaluating LLM-powered coding agents on iterative software engineering tasks. The primary benchmark, Iterative REST Server Generation, asks an agent to implement a REST API from a natural-language specification, runs a test suite against the agent's server, feeds back failures, and iterates. After each turn the schema evolves (new entities, renamed fields, new guard conditions, etc.) and the agent must keep the server consistent with the updated spec.
How it fits together
There are three moving parts:
1. Scenario data — a deterministic schema (entities, fields, actions, analytics) plus, per turn, a natural-language spec, a pytest suite, and a ground-truth Flask server. Generated either programmatically (offline, no LLM) or via LLM (richer, requires Bedrock). 2. Agent harness — a thin Python wrapper around a CLI (Mini-SWE, OpenHands, OpenCode, …) that talks to the agent over stdin/stdout, records a trajectory, and runs the agent's command inside a Docker container. 3. Evaluation loop — for each scenario, hand the agent the spec + tests + a working dir, let it produce a server, run the test suite, feed failures back, and either move on after the turn passes or retry up to attempt_limit times. Then evolve the schema and repeat for n_turns.
Scoring is passed_tests / total_tests per turn, averaged across turns and then scenarios.
Supported Agents
| Agent | Module | Docker image | Dockerfile | |---|---|---|---| | Mini-SWE | staminabench/agents/mini_swe_agent.py | agent-benchmarking:mini | staminabench/agents/docker/Dockerfile.mini | | OpenHands | staminabench/agents/openhands_agent.py | agent-benchmarking:openhands | staminabench/agents/docker/Dockerfile.openhands | | OpenCode | staminabench/agents/opencode_agent.py | agent-benchmarking:opencode | staminabench/agents/docker/Dockerfile.opencode | | Qwen Code | staminabench/agents/qwen_code_agent.py | user-supplied | — | | Kimi CLI | staminabench/agents/kimi_cli_agent.py | user-supplied | — | | Vibe | staminabench/agents/vibe_agent.py | user-supplied | — | | Mock (testing) | staminabench/agents/mock_agent.py | — | — |
All CLI agents share the template-method runner in staminabench/agents/cli_agent.py. Adding a new agent means subclassing CLIBasedAgent and implementing _build_command, _parse_response, and _should_retry.
Setup
1. Python environment
The project is managed with uv. Install uv (curl -LsSf https://astral.sh/uv/install.sh | sh), then from the repo root:
uv sync --extra dev
That creates a .venv/ with everything pinned by uv.lock. Run anything through uv run:
uv run python -m staminabench.run_eval ... uv run pytest tests/
Some transitive deps (notably numpy) compile native extensions — uv's bundled Python ships its own libc, but you still need a C/C++ toolchain on the host (gcc ≥ 9.3, libc headers). On stock Ubuntu 22.04+ / Debian 12+ / macOS with Xcode CLT this is already there; older distros may need attention.
2. AWS / Bedrock
The default model backend is Bedrock. You need:
- an AWS profile with Bedrock access (
aws configure/~/.aws/credentials), or an EC2 instance role with Bedrock permissions - the model IDs you plan to use enabled in the AWS console under Bedrock → Model access.
- Default used by this repo:
zai.glm-5(Z.ai GLM-5 on Bedrock). - Region:
us-east-1(configurable viaAWS_REGION).
For long-running jobs on EC2, prefer the instance role and unset AWS_ACCESS_KEY_ID AWS_SECRET_ACCESS_KEY AWS_SESSION_TOKEN so session tokens don't expire mid-run.
If you only want to run the programmatic benchmark (no LLM-sampled scenarios) you don't need Bedrock — the benchmark generator runs entirely offline.
3. Docker images
Each CLI agent runs inside its own container. All required CLIs are installed from the public registries (PyPI / npm); the build context is just src/staminabench/agents/docker/ and nothing outside the repo is needed.
From the repo root:
bash src/staminabench/agents/docker/build.sh # build all three images bash src/staminabench/agents/docker/build.sh mini # or build just one bash src/staminabench/agents/docker/build.sh mini opencode # or a subset bash src/staminabench/agents/docker/test.sh # smoke-test all built images
Resulting images (~1.5–1.8 GB each):
| Tag | CLI | Source | |---|---|---| | agent-benchmarking:mini | mini | mini-swe-agent on PyPI | | agent-benchmarking:opencode | opencode | opencode-ai on npm | | agent-benchmarking:openhands | openhands | openhands on PyPI |
Pass the tag as docker.image_name=agent-benchmarking: when running staminabench.run_eval.
For Qwen Code, Kimi CLI, and Vibe, supply your own image — those upstreams don't publish a clean PyPI/npm wheel that installs in a base Ubuntu container.
Env vars forwarded into the container
The harness automatically forwards the host's AWS credential chain into the container so Bedrock works inside: AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_SESSION_TOKEN, AWS_BEARER_TOKEN_BEDROCK, AWS_REGION, AWS_PROFILE. Anything else needs to be added explicitly via docker.env_vars or docker.extra_docker_args.
Host UID mismatch
The container runs with -u $(id -u):$(id -g) so files written into the mounted workspace are owned by the host user. If your host UID isn't in the image's /etc/passwd (common on shared dev machines / EC2 with custom UIDs), tools inside the container can't find a home directory and fail to write their config (e.g. OpenCode failing on ~/.local, OpenHands on its persistence dir).
Workaround: run as root inside the container by passing extra docker args:
docker.extra_docker_args="-u 0:0 -e HOME=/root -e XDG_CONFIG_HOME=/root/.config -e XDG_DATA_HOME=/root/.local/share"
The OpenCode image already ships its config under both /home/ubuntu/.config/opencode/ and /root/.config/opencode/ so this workaround works without a rebuild.
Running an Evaluation
Config-driven via OmegaConf. Two input channels: --configs (YAML files) and...
Excerpt shown — open the source for the full document.
Notability
notability 6.0/10New benchmark repo from Amazon Research.