QwenLM/RationaleRM
Python
Captured source
source ↗QwenLM/RationaleRM
Language: Python
Stars: 35
Forks: 3
Open issues: 2
Created: 2026-02-02T14:15:02Z
Pushed: 2026-03-18T08:49:41Z
Default branch: main
Fork: no
Archived: no
README:
---
📖 Overview
RationaleRM is a research project that investigates how to align not just the *outcomes* but also the *reasoning processes* of reward models with human judgments. We discover that generative reward models (GenRMs) and LLM-as-a-Judge exhibit Deceptive Alignment issues — models may reach the same final result as humans through superficial or even incorrect reasoning processes.
To address this, we propose the Rationale Consistency metric, which measures the alignment between the model's reasoning process and human judgment rationales. We also design the MetaJudge framework to compute this metric: it decomposes human and model rationales into atomic units, then performs strict one-to-one semantic matching to precisely quantify their consistency.
Core Contributions:
- 🔍 MetaJudge Framework: Decomposes human rationales into atomic units and uses LLMs for strict one-to-one semantic matching
- 📊 Rationale Consistency Metric: Effectively detects deceptive alignment and distinguishes frontier models (e.g., GPT-5 or Gemini 3 Pro)
- 🛠️ Hybrid Reward Training: Combines rationale reward (Average Precision) and outcome reward to prevent "rationale degeneration"
- 🏆 SOTA Performance: Achieves best results on RM-Bench (87.1%) and JudgeBench (82.0%)
---
🚨 Key Finding: The Deceptive Alignment Trap
We evaluated 19 frontier models and found two critical flaws when relying solely on outcome accuracy:
Outcome Accuracy Cannot Distinguish Frontier Models
In the green region, although multiple models achieve similar outcome accuracy, rationale consistency clearly distinguishes stronger models (such as GPT-5, o3, Gemini 3 Pro) from weaker ones (such as Claude 3.5, GPT-4.1).
Outcome Accuracy Cannot Detect Deceptive Alignment
The most typical example is the comparison between o3 and o3-mini: both have similar outcome accuracy, but o3-mini's rationale consistency is nearly 50% lower. o3-mini relies on surface cues (such as formatting, emojis) to make judgments, while o3 performs rigorous fact-checking like humans do.
> 💡 Key Insight: Models can make correct choices for wrong reasons. Outcome accuracy alone cannot detect this deceptive alignment.
---
📉 Training Finding: Outcome-Only Supervision Leads to Rationale Degeneration
Training dynamics comparison: Similar outcome rewards, but significantly different rationale rewards
The figure above shows a key finding during training: outcome-only supervision leads to continuous decline in model-human reasoning process consistency.
- Left: Both methods achieve nearly identical outcome rewards, indicating models can learn to select correct answers
- Right: Rationale rewards show significant divergence — without rationale consistency constraints, model rationale rewards continuously decline, ultimately 24.2% lower than our method
This reveals the Rationale Degeneration phenomenon: when intermediate reasoning processes are not incentivized, models abandon high-cost evidence verification and instead rely on cheaper surface cues to achieve similar outcome rewards.
---
🏆 Main Results
We evaluate on two challenging benchmarks:
- RM-Bench: Evaluates model ability to distinguish subtle differences and style biases
- JudgeBench: Emphasizes deep judgment and logical reasoning
| Model | RM-Bench | JudgeBench | Avg | | :------------------------------------- | :------------: | :------------: | :-----------: | | Generative Reward Models | | | | | RM-R1-Distilled-Qwen-32B | 83.9 | 78.8 | 81.4 | | RRM-32B | 73.1 | 75.7 | 74.4 | | Nemotron-Super-49B | 82.7 | 77.2 | 80.0 | | RewardAnything-8B-v1 | 83.1 | 62.6 | 72.9 | | GRAM-R² | 85.7 | 81.0 | 83.4 | | Outcome-Only Baselines | | | | | Qwen3-14B (Outcome-Only) | 83.6 | 70.0 | 76.8 | | Qwen3-30B-A3B (Outcome-Only) | 84.9 | 75.7 | 80.3 | | Our Method (Outcome + Rationale) | | | | | Qwen3-14B (Ours) | 86.7 | 79.1 | 82.9 | | Qwen3-30B-A3B (Ours) | 87.1 | 82.0 | 84.6 |
> 💡 Our method effectively reverses the rationale consistency decline observed during outcome-only training (from 25% to 37%).
---
🚀 Quick Start
Project Structure
RationaleRM/ ├── metajudge_infer.py # Semantic matching inference script ├── metajudge_infer.sh # Shell script for running inference ├── metajudge_analysis.py # Analysis script for computing metrics ├── images/ # Images │ ├── overall_compare.png │ └── reward_compare.png ├── data/ # Datasets │ ├── helpsteer3_test_1000.jsonl # Test set: 1000 samples │ └── helpsteer3_human_checklist.jsonl # Full dataset (22,116 samples) └── example/ # Example data for testing ├── infer_input_10samples.jsonl ├── model-low_deceptive_alignment.jsonl └── model-high_deceptive_alignment.jsonl
Step 1: Prepare Data
Input data should be in JSONL format with the following fields:
human-checklist: List of human atomic rationales (reference){model}-checklist: List of model-generated atomic rationales to be evaluated
Example:
{
"domain": "general",
"context": [...],
"response1": "...",
"response2": "...",
"human-checklist": [
"Response 1 lacks polysyllabic rhymes",
"Response 2's meter is inconsistent"
],
"model-low_deceptive_alignment-checklist": [
"Response A's rhyme scheme is forced",
"Response B's rhythm feels awkward"
]
}Step 2: Run Inference
The inference script evaluates how well each model-generated checklist item matches the human checklist:
# Set environment variables export OPENAI_API_KEY="your-api-key" export OPENAI_BASE_URL="https://api.openai.com/v1" # Optional, defaults to OpenAI # Run inference python metajudge_infer.py \ --input-file data/helpsteer3_test_1000.jsonl \ --output-file output/results.jsonl \ --model gpt-4o \ --model-be-evaluated model-low_deceptive_alignment \ --concurrent-requests 5
Or use the shell script:
bash metajudge_infer.sh
Key parameters:
--input-file: Path to input JSONL file--output-file: Path for output results--model: LLM model for semantic matching (e.g., gpt-4o, qwen-plus)--model-be-evaluated: The critic model whose checklist will be evaluated--concurrent-requests: Number of parallel API requests
API configuration (via environment…
Excerpt shown — open the source for the full document.
Notability
notability 4.0/10Low-star repo from Qwen team