What does this repo signal mean?

Amazon (Nova) published amazon-science/StaminaBench (Python). This repository signal exposes tooling, eval, infrastructure, or model-adjacent work before it may appear in a launch post. High-signal details: repo amazon-science/StaminaBench · language Python · Benchmark for testing long-context language model performance.. onlylabs links this event to 1 captured evidence page and 6 related repo signals.

Amazon (Nova) Repo: amazon-science/StaminaBench

Captured source

source ↗

GitHub/github.com/amazon-science/StaminaBench

amazon-science/StaminaBench repository metadata

Source ↗

published Jun 15, 2026seen 6dcaptured 6dhttp 200method plain

amazon-science/StaminaBench

Language: Python

License: NOASSERTION

Stars: 0

Forks: 0

Open issues: 5

Created: 2026-06-15T20:04:53Z

Pushed: 2026-06-20T00:24:05Z

Default branch: main

Fork: no

Archived: no

README:

StaminaBench: Stress-Testing Coding Agents over 100 Interaction Turns

[paper]()

StaminaBench is a framework for evaluating LLM-powered coding agents on iterative software engineering tasks. The primary benchmark, Iterative REST Server Generation, asks an agent to implement a REST API from a natural-language specification, runs a test suite against the agent's server, feeds back failures, and iterates. After each turn the schema evolves (new entities, renamed fields, new guard conditions, etc.) and the agent must keep the server consistent with the updated spec.

How it fits together

There are three moving parts:

1. Scenario data — a deterministic schema (entities, fields, actions, analytics) plus, per turn, a natural-language spec, a pytest suite, and a ground-truth Flask server. Generated either programmatically (offline, no LLM) or via LLM (richer, requires Bedrock). 2. Agent harness — a thin Python wrapper around a CLI (Mini-SWE, OpenHands, OpenCode, …) that talks to the agent over stdin/stdout, records a trajectory, and runs the agent's command inside a Docker container. 3. Evaluation loop — for each scenario, hand the agent the spec + tests + a working dir, let it produce a server, run the test suite, feed failures back, and either move on after the turn passes or retry up to attempt_limit times. Then evolve the schema and repeat for n_turns.

Scoring is passed_tests / total_tests per turn, averaged across turns and then scenarios.

Supported Agents

| Agent | Module | Docker image | Dockerfile | |---|---|---|---| | Mini-SWE | staminabench/agents/mini_swe_agent.py | agent-benchmarking:mini | staminabench/agents/docker/Dockerfile.mini | | OpenHands | staminabench/agents/openhands_agent.py | agent-benchmarking:openhands | staminabench/agents/docker/Dockerfile.openhands | | OpenCode | staminabench/agents/opencode_agent.py | agent-benchmarking:opencode | staminabench/agents/docker/Dockerfile.opencode | | Qwen Code | staminabench/agents/qwen_code_agent.py | user-supplied | — | | Kimi CLI | staminabench/agents/kimi_cli_agent.py | user-supplied | — | | Vibe | staminabench/agents/vibe_agent.py | user-supplied | — | | Mock (testing) | staminabench/agents/mock_agent.py | — | — |

All CLI agents share the template-method runner in staminabench/agents/cli_agent.py. Adding a new agent means subclassing CLIBasedAgent and implementing _build_command, _parse_response, and _should_retry.

Setup

1. Python environment

The project is managed with uv. Install uv (curl -LsSf https://astral.sh/uv/install.sh | sh), then from the repo root:

uv sync --extra dev

That creates a .venv/ with everything pinned by uv.lock. Run anything through uv run:

uv run python -m staminabench.run_eval ...
uv run pytest tests/

Some transitive deps (notably numpy) compile native extensions — uv's bundled Python ships its own libc, but you still need a C/C++ toolchain on the host (gcc ≥ 9.3, libc headers). On stock Ubuntu 22.04+ / Debian 12+ / macOS with Xcode CLT this is already there; older distros may need attention.

2. AWS / Bedrock

The default model backend is Bedrock. You need:

an AWS profile with Bedrock access (aws configure / ~/.aws/credentials), or an EC2 instance role with Bedrock permissions
the model IDs you plan to use enabled in the AWS console under Bedrock → Model access.
Default used by this repo: zai.glm-5 (Z.ai GLM-5 on Bedrock).
Region: us-east-1 (configurable via AWS_REGION).

For long-running jobs on EC2, prefer the instance role and unset AWS_ACCESS_KEY_ID AWS_SECRET_ACCESS_KEY AWS_SESSION_TOKEN so session tokens don't expire mid-run.

If you only want to run the programmatic benchmark (no LLM-sampled scenarios) you don't need Bedrock — the benchmark generator runs entirely offline.

3. Docker images

Each CLI agent runs inside its own container. All required CLIs are installed from the public registries (PyPI / npm); the build context is just src/staminabench/agents/docker/ and nothing outside the repo is needed.

From the repo root:

bash src/staminabench/agents/docker/build.sh # build all three images
bash src/staminabench/agents/docker/build.sh mini # or build just one
bash src/staminabench/agents/docker/build.sh mini opencode # or a subset
bash src/staminabench/agents/docker/test.sh # smoke-test all built images

Resulting images (~1.5–1.8 GB each):

| Tag | CLI | Source | |---|---|---| | agent-benchmarking:mini | mini | mini-swe-agent on PyPI | | agent-benchmarking:opencode | opencode | opencode-ai on npm | | agent-benchmarking:openhands | openhands | openhands on PyPI |

Pass the tag as docker.image_name=agent-benchmarking: when running staminabench.run_eval.

For Qwen Code, Kimi CLI, and Vibe, supply your own image — those upstreams don't publish a clean PyPI/npm wheel that installs in a base Ubuntu container.

Env vars forwarded into the container

The harness automatically forwards the host's AWS credential chain into the container so Bedrock works inside: AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_SESSION_TOKEN, AWS_BEARER_TOKEN_BEDROCK, AWS_REGION, AWS_PROFILE. Anything else needs to be added explicitly via docker.env_vars or docker.extra_docker_args.

Host UID mismatch

The container runs with -u $(id -u):$(id -g) so files written into the mounted workspace are owned by the host user. If your host UID isn't in the image's /etc/passwd (common on shared dev machines / EC2 with custom UIDs), tools inside the container can't find a home directory and fail to write their config (e.g. OpenCode failing on ~/.local, OpenHands on its persistence dir).

Workaround: run as root inside the container by passing extra docker args:

docker.extra_docker_args="-u 0:0 -e HOME=/root -e XDG_CONFIG_HOME=/root/.config -e XDG_DATA_HOME=/root/.local/share"

The OpenCode image already ships its config under both /home/ubuntu/.config/opencode/ and /root/.config/opencode/ so this workaround works without a rebuild.

Running an Evaluation

Config-driven via OmegaConf. Two input channels: --configs (YAML files) and...

Excerpt shown — open the source for the full document.

Notability

notability 6.0/10

New benchmark repo from Amazon Research.