RepoDatabricks (DBRX)Databricks (DBRX)published Mar 9, 2026seen 5d

databricks/pilot-commit

Python

Open original ↗

Captured source

source ↗
published Mar 9, 2026seen 5dcaptured 14hhttp 200method plain

databricks/pilot-commit

Description: Pilot-Commit: Rollout allocation for efficient group-based RL post-training

Language: Python

License: Apache-2.0

Stars: 2

Forks: 0

Open issues: 0

Created: 2026-03-09T16:30:06Z

Pushed: 2026-05-28T06:15:30Z

Default branch: main

Fork: no

Archived: no

README:

Woojeong Kim1,2 · Ziyi Yang1 · Jing Nathan Yan2 · Jialu Liu1

1Databricks · 2Cornell University

| ![Results 1.5B](assets/cover-1.5b-deepmath-n128.png) | ![Results 8B](assets/cover-8b-n64.png) | |:---:|:---:| | *Qwen2.5-1.5B-Math, DeepMath-103K (n=128)* | *Qwen3-8B, Polaris-53K (n=64)* |

TL;DR: We propose Pilot-Commit, a rollout allocation framework for group-based RL post-training that replaces uniform sampling with a targeted, budget-aware strategy—reaching baseline accuracy up to 1.9× faster than GRPO and 4.0× faster than DAPO in cumulative rollouts.

Method

RL post-training for LLMs is bottlenecked by rollout generation. Group-based methods like GRPO compute advantages from multiple rollouts per prompt, yet allocate budget uniformly—wasting compute on prompts with collapsed reward distributions that yield no learning signal.

Pilot-Commit decouples prompt evaluation from exploitation:

  • Pilot stage: Estimates per-prompt informativeness using a fraction of the budget
  • Commit stage: Allocates remaining rollouts to high-leverage prompts while skipping low-signal prompts

![Pilot-Commit Overview](assets/main_final.png)

*Overview of Pilot-Commit. In the pilot stage, a fraction of the rollout budget is used to estimate empirical reward variance per prompt. In the commit stage, the remaining budget is allocated to selected prompts. Prompts deemed too hard are deferred; prompts deemed too easy are evicted from future sampling.*

Results

Cumulative rollouts to reach target accuracy (n = rollouts per prompt, M = million):

| Config | n | PC | GRPO | DAPO | vs GRPO | vs DAPO | |--------|---|-----|------|------|---------|---------| | 1.5B DeepMath | 128 | 10.6M | 16.0M | 42.0M | 1.5× | 4.0× | | 4B Polaris | 64 | 1.7M | 2.5M | — | 1.5× | — | | 8B Polaris | 64 | 1.0M | 1.6M | 2.7M | 1.5× | 2.6× | | 14B Polaris | 64 | 2.2M | 4.1M | 4.9M | 1.9× | 2.3× |

*— indicates DAPO did not reach target accuracy within the training budget.*

See paper for full results across exploration settings.

Getting Started

Our implementation is based on volcengine/verl.

1. Environment Setup

Option A: Docker (recommended)

Use a pre-built Docker image with all dependencies:

docker pull hiyouga/verl:ngc-th2.7.1-cu12.6-vllm0.10.0

docker create --runtime=nvidia --gpus all --net=host --shm-size="10g" \
--cap-add=SYS_ADMIN -v .:/workspace --name pilot-commit \
hiyouga/verl:ngc-th2.7.1-cu12.6-vllm0.10.0 sleep infinity
docker start pilot-commit
docker exec -it pilot-commit bash

Then inside the container:

cd /workspace
pip install --no-deps -e .
pip install math-verify==0.8.0 torch==2.7.1
pip install --upgrade 'pyarrow>=19.0.0'

Option B: Custom environment

Follow the verl installation guide to set up CUDA, cuDNN, and inference engines, then:

pip install --no-deps -e .

2. Download & Preprocess Data

bash data/e2e_process_data.sh

3. Training

Qwen2.5-1.5B-Math on DeepMath-103K (2 nodes, n=128):

# GRPO baseline
bash train_scripts/grpo_qwen2.5-1.5b_n128.sh

# DAPO baseline
bash train_scripts/dapo_qwen2.5-1.5b_n128.sh

# Pilot-Commit (ours)
bash train_scripts/pc_qwen2.5-1.5b_p32c96.sh

Qwen3-4B on Polaris-53K (4 nodes, n=64):

# GRPO baseline
bash train_scripts/grpo_qwen3-4b_n64.sh

# DAPO baseline
bash train_scripts/dapo_qwen3-4b_n64.sh

# Pilot-Commit (ours)
bash train_scripts/pc_qwen3-4b_p16c48.sh

See all scripts in train_scripts/ folder.

Core Implementation

The Pilot-Commit algorithm is implemented in recipe/pc/:

recipe/pc/
├── main_pc.py # Entry point (Hydra)
├── pc_ray_trainer.py # RayPCTrainer: pilot-commit training loop
├── replay_buffer.py # Replay buffer for pilot survivors
├── utils.py # Prompt selection utilities
└── config/
└── pc_trainer.yaml # Default configuration

Key Components

Pilot-Commit Trainer (pc_ray_trainer.py):

  • RayPCTrainer extends RayPPOTrainer with pilot-commit sampling logic
  • Implements diversity-based prompt filtering
  • Manages replay buffer for deferred prompts

Configuration (config/pc_trainer.yaml): Key hyperparameters for Pilot-Commit:

algorithm:
diversity_threshold_upper: 0.25 # p_upper: skip prompts with success rate > threshold
diversity_threshold_lower: 0.125 # p_lower: defer prompts with success rate < threshold
exclude_threshold_upper: 1.0 # p_solve: evict prompts exceeding this threshold
buffer_max_off_steps: 4 # Maximum staleness for replay buffer
exploration:
n: 8 # Number of pilot rollouts per prompt

Citation

If you find this work useful, please cite:

@article{kim2025pilotcommit,
title = {Spend Your Rollouts Where It Counts: Rollout Allocation for Group-Based RL Post-Training},
author = {Kim, Woojeong and Yang, Ziyi and Yan, Jing Nathan and Liu, Jialu},
year = {2025},
journal = {arXiv preprint}
}

Acknowledgements

This implementation is built on top of verl (HybridFlow). We thank the verl team for their excellent RL training framework.

Notability

notability 1.0/10

Low traction, trivial repo.