RepoAmazon (Nova)Amazon (Nova)published May 12, 2026seen 5d

amazon-science/PROF-GRPO

Python

Open original ↗

Captured source

source ↗
published May 12, 2026seen 5dcaptured 14hhttp 200method plain

amazon-science/PROF-GRPO

Language: Python

License: Apache-2.0

Stars: 3

Forks: 0

Open issues: 1

Created: 2026-05-12T19:43:55Z

Pushed: 2026-06-10T17:16:54Z

Default branch: main

Fork: no

Archived: no

README:

Table of Contents

  • [Introduction](#introduction)

Introduction

Current Dilemma: Process vs. Outcome

Current Reinforcement learning with verifiable rewards (RLVR) faces a challenge:

  • Outcome Reward Models (ORMs): They are "results-oriented," caring only whether the final answer is correct. This is reliable but also "blind"—they cannot distinguish between a logically perfect solution and one with flawed reasoning steps but happens to be correct.
  • Process Reward Models (PRMs): They are "process-oriented" and can evaluate each step of the reasoning. This provides finer-grained guidance, but PRMs themselves are often inaccurate and susceptible to "reward hacking" by the model.

Simply blending the two often introduces unstable training signals and can even lead to worse performance.

Our Solution: PROF - PRocess cOnsistency Filter

We propose PRocess cOnsistency Filter (PROF), an effective data process curation method that the strengths of process rewards (PRMs) and outcome rewards (ORMs). The core idea of PROF is not to use the PRM's score directly for training, but rather to use it as a "consistency filter" to select the highest-quality training data. Specifically, PROF filters the correct and incorrect samples separately and remove the most deceptive "noise":

  • Correct: remove the responses with the lowest PRM values averaged over steps
  • Incorrect: remove the responses with the highest PRM values averaged over steps

The number to remove in correct and incorrect group is to balance the correct-incorrect ratio.

Main Results

| Model | Algorithm | Math500 | Minerva Math | Olympiad Bench | AIME24 | AMC23 | Average | | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | | Qwen2.5-Math-1.5B-base | Base | 39.9 | 11.4 | 19.1 | 3.5 | 23.6 | 19.5 | | | GRPO | 70.3 | 29.1 | 33.0 | 9.0 | 44.5 | 37.2 | | | Blend | 67.6 | 27.8 | 31.1 | 7.7 | 42.5 | 35.3 | | | PROF-GRPO | 73.2 | 30.0 | 36.1 | 9.6 | 49.1 | 39.6 | | Qwen2.5-Math-7B-base | Base | 42.0 | 12.8 | 19.2 | 12.9 | 30.0 | 23.4 | | | GRPO | 81.6 | 37.2 | 45.5 | 20.6 | 64.4 | 49.9 | | | Blend | 81.7 | 36.7 | 45.0 | 15.2 | 58.0 | 47.3 | | | PROF-GRPO | 83.1 | 39.0 | 47.8 | 17.5 | 70.9 | 51.7 |

_Table 1: Performance of different algorithms across five benchmarks including Math500, Minerva Math, Olympiad Bench, AMC2023 and AIME2024. We denote Blend-PRM-GRPO by Blend for short. We tune all the algorithms to their best performance and refer the readers to Appendix for the detailed parameter setup. The reported accuracy is average@16 under temperature 1.0._

Main Takeways:

1. PROF-GRPO v.s. Vanilla GRPO: Compared to vanilla GRPO, PROF-GRPO consistently improves final accuracy by around $2\\%$. More improtantly, PROF-GRPO can effectively differentiate and filter out the flawed responses with correct final answer and reduce misleading gradients, while ORM cannot distinguish those samples. An example of flawed responses filtered out by PROF is presented below.

2. PROF-GRPO v.s. Blend: Blend denotes naively mixing PRM with ORM in the objective and is a common way to apply PRM into RL training. As shown in the figure and table above, Blend is outperformed by PROF-GRPO by over $4\\%$ and suffers from severe reward hacking due to the mixing gradients of PRM and ORM. 3. Improved Reasoning Quality: Our method not only increases the accuracy of the final answer but also strengthens the quality and reliability of the model's intermediate reasoning steps by segmenting CoT into more simple and easy-to-verify steps.

4. Filter separately: By comparing the performance with w/o Separation, which ranks and filters correct and incorrect samples together, we find that separate then into two subgroups and balance the correct/incorrect ratio is essential. Additionally, filtering only correct samples with consistency and incorrect samples randomly (Filter-Correct) performs comparably to filtering both correct and incorrect samples with consistency (Filter-Both), while "Both" converges more efficiently. This implies that the quality of correct responses is the crucial part of online training. Moreover, through experiments on more rollout numbers and LLaMA-3.2-3B-instruct, we find that for robustness, especially when the PRM is untrustworthy or out-of-distribution, Filter-correct is the safer option as it constrains the PRM's influence.

Quick Start

1) Environment Setup

# 1. Create and activate a clean python environment
python -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip

# 2. Install project package
pip install -e .

# 3. Optional extras (pick based on your runtime backend)
# vLLM rollout
pip install -e ".[vllm]"
# math evaluation utilities
pip install -e ".[math]"

If you use GPU training, make sure your CUDA/PyTorch stack matches your machine.

2) Data Preprocessing

# Build Math500 parquet files
python scripts/data_preprocess/math_dataset.py \
--local_dir "$HOME/PRM_filter/data/math500"

# Build Numina-Math parquet files
python scripts/data_preprocess/numina_math.py \
--local_dir "$HOME/PRM_filter/data/numina_math"

After preprocessing, confirm these files exist:

  • $HOME/PRM_filter/data/math500/normal_train.parquet
  • $HOME/PRM_filter/data/math500/normal_test.parquet
  • $HOME/PRM_filter/data/numina_math/normal_train.parquet

3) Run Training Scripts (scripts/)

All scripts are under scripts/:

  • run_grpo.sh: GRPO baseline
  • run_prof_grpo.sh: PROF-GRPO (both correct/incorrect filtering)
  • run_prof_grpo_correct.sh: PROF-GRPO (correct-side filtering focus)
  • run_blend_grpo.sh: Blend-PRM-GRPO baseline
  • process_ckp.sh: checkpoint merge/upload helper

Run examples:

# GRPO baseline
bash scripts/run_grpo.sh

# PROF-GRPO (both)
bash scripts/run_prof_grpo.sh

# PROF-GRPO (correct)
bash scripts/run_prof_grpo_correct.sh

# Blend baseline
bash scripts/run_blend_grpo.sh

Training logs are written to logs//.log.

4) Evaluation and Plotting

# Evaluate a model checkpoint (adjust paths/model args as needed)
python eval/main_eval.py \
--model_name_or_path "Qwen/Qwen2.5-Math-7B" \
--dataset_name_or_path…

Excerpt shown — open the source for the full document.

Notability

notability 3.0/10

Low stars, routine research repo