novitalabs/R2E-Gym
forked from R2E-Gym/R2E-Gym
Captured source
source ↗novitalabs/R2E-Gym
Description: [COLM 2025] Official repository for R2E-Gym: Procedural Environment Generation and Hybrid Verifiers for Scaling Open-Weights SWE Agents
Language: Python
License: Apache-2.0
Stars: 0
Forks: 0
Open issues: 0
Created: 2026-01-20T10:28:49Z
Pushed: 2026-01-20T12:31:20Z
Default branch: main
Fork: yes
Parent repository: R2E-Gym/R2E-Gym
Archived: no
README: R2E-Gym: Procedural Environment Generation and Hybrid Verifiers for Scaling Open-Weights SWE Agents
Naman Jain*,1, Jaskirat Singh*,2, Manish Shetty1, Liang Zheng2, Koushik Sen1, Ion Stoica1
1UC Berkeley, 2ANU *Equal contribution, ^Equal supervision
📃 Paper • 🤗 Data & Models •
🌐 Project Page
---
🚨 UPDATES
🔥 NEW: DeepSWE Models Available! We've released DeepSWE, our latest state-of-the-art SWE agent models that achieve exceptional performance on SWE-Bench trained with **rLLM**.
- 🤗 Model: agentica-org/DeepSWE-Preview
- 📋 Reproduction Guides: Check out our detailed reproduction guides in the [
reproduction/](./reproduction/) folder: - [
DEEPSWE_REPRODUCTION.MD](./reproduction/DEEPSWE_REPRODUCTION.MD) - Complete guide for reproducing DeepSWE results - [
DEEPSWE_TTS_REPRODUCTION.MD](./reproduction/DEEPSWE_TTS_REPRODUCTION.MD) - Test-time scaling reproduction guide
---
We present R2E-Gym, the largest procedurally curated environment for training real-world SWE-Agents. We show that R2E-Gym enables more scalable train and test-time scaling, achieving 51% on the SWE-Bench Verified benchmark, reflecting a new state-of-the-art for open-weight SWE-Agents and for first time being competitive with proprietary models such as o1 and sonnet-3.5-v2 with tools.

R2E-Gym is powered by two main contributions: (a) SWE-GEN: a synthetic data curation recipe for curating executable training environments w/o relying on human tests and issues. (b) Hybrid Inference Time Scaling: showing that while both execution-based and execution-free verifiers elicit inference-time gains; significantly better performance can be achieved by leveraging the strengths of both. (c) Overall, the final approach reflects SOTA performance for open-weight SWE-Agents, while also being competitive with some proprietary model baselines.
---
> While LLM-based SWE-Agents have demonstrated remarkable improvements, state-of-the-art performance is largely driven by proprietary models — with open-models lagging behind. Closing this performance gap requires addressing two core challenges: First, we need scalable methods to curate diverse, high-quality execution environments for training. Second, we need efficient strategies for scaling test-time compute. R2EGym presents a joint framework for address both these challenges.
R2E-Gym Environment
We create R2E-Gym, the largest procedurally curated gym environment for training real-world SWE-Agents, — consisting of more than 8.1K problems across 13 repos, with executable gym environments, unit tests, and natural-language task descriptions.
Synthetic Data Enables Scalable Agent Training
R2E-Gym is powered by SWE-GEN — a novel synthetic data curation recipe that enables collection of a large number of executable training environments without reliance on human-written pull requests (PRs) or unit tests. We show that instead of using human-written PRs, good-quality execution environments can directly be curated from commits. Compared to PR-based data collection, we find that this approach enables more scalable data curation and agent-training, resulting in a SOTA pass@1 performance of 34.4% on the challenging SWE-Bench Verified benchmark.
Hybrid Test-time Scaling
Finally, we introduce Hybrid Test-time Scaling, a novel paradigm for scaling test-time compute. We show that while both execution-based and execution-free verifiers elicit inference-time gains; they exchit complementary strengths and weakness. Leveraging the strengths of each approach allows significantly better performance when scaling test-time compute - resulting in a 51% pass@1 performance on the SWE-Bench Verified benchmark, reflecting a new state-of-the-art for open-weight SWE-Agents.
---
🔧 Setup
> [!IMPORTANT] > Installation is required!
## Install uv curl -LsSf https://astral.sh/uv/install.sh | sh source $HOME/.local/bin/env # activate venv uv venv source .venv/bin/activate uv sync && uv pip install -e .
🚀 Quickstart
- Usage: R2E-Gym environment can be simply used as:
from r2egym.agenthub.environment.env import EnvArgs, RepoEnv
from r2egym.agenthub.agent.agent import AgentArgs, Agent
from pathlib import Path
from datasets import load_dataset
# load gym dataset [R2E-Gym/R2E-Gym-Subset, R2E-Gym/R2E-Gym-Full, R2E-Gym/SWE-Bench-Verified, R2E-Gym/SWE-Bench-Lite]
ds = load_dataset("R2E-Gym/R2E-Gym-Lite")
split = 'train' # split of the dataset [train, test]
# load gym environment
env_index = 100 # index of the environment [0, len(ds)]
env_args = EnvArgs(ds = ds[split][env_index])
env = RepoEnv(env_args)
# load agent
agent_args = AgentArgs.from_yaml(Path('./src/r2egym/agenthub/config/edit_fn_calling.yaml'))
# define llm: ['claude-3-5-sonnet-20241022', 'gpt-4o', 'vllm/R2E-Gym/R2EGym-32B-Agent']
agent_args.llm_name = 'claude-3-5-sonnet-20241022'
agent = Agent(name="EditingAgent", args=agent_args)
# run the agent (note: disable fn_calling for R2E-Gym agents)
output = agent.run(env, max_steps=40, use_fn_calling=True)> [!NOTE] > The output of the agent is a Trajectory object, which contains detailed stats including full agent trajectory, problem statement, max execution time, exit-reason, and output patch. Please refer src/r2egym/agenthub/agent/agent.py and src/r2egym/agenthub/trajectory/trajectory.py for more details.
- Reward Calculation: All R2E-Gym environments support automated reward calculation using unit tests.
# calculate reward out = env.runtime._calculate_reward()
- Gym Environment Stats: The detailed stats for each environment (including natural language task description, repo name, ground truth patch) can be easily accessed as,
# get the environment stats env_stats_dict = env.get_stats()
> [!TIP] > R2EGym environments also offer a range of other convenient functions, such as apply_patch,…
Excerpt shown — open the source for the full document.
Notability
notability 1.0/10Routine fork, no notable traction