What does this repo signal mean?

Qwen (Alibaba Cloud) published QwenLM/RationaleRM (Python). This repository signal exposes tooling, eval, infrastructure, or model-adjacent work before it may appear in a launch post. High-signal details: repo QwenLM/RationaleRM · language Python · Low-star repo from Qwen team. onlylabs links this event to 1 captured evidence page and 6 related repo signals.

Qwen (Alibaba Cloud) Repo: QwenLM/RationaleRM

Captured source

source ↗

GitHub/github.com/QwenLM/RationaleRM

QwenLM/RationaleRM repository metadata

Source ↗

published Feb 2, 2026seen Jun 5captured Jun 11http 200method plain

QwenLM/RationaleRM

Language: Python

Stars: 35

Forks: 3

Open issues: 2

Created: 2026-02-02T14:15:02Z

Pushed: 2026-03-18T08:49:41Z

Default branch: main

Fork: no

Archived: no

README:

---

📖 Overview

RationaleRM is a research project that investigates how to align not just the *outcomes* but also the *reasoning processes* of reward models with human judgments. We discover that generative reward models (GenRMs) and LLM-as-a-Judge exhibit Deceptive Alignment issues — models may reach the same final result as humans through superficial or even incorrect reasoning processes.

To address this, we propose the Rationale Consistency metric, which measures the alignment between the model's reasoning process and human judgment rationales. We also design the MetaJudge framework to compute this metric: it decomposes human and model rationales into atomic units, then performs strict one-to-one semantic matching to precisely quantify their consistency.

Core Contributions:

🔍 MetaJudge Framework: Decomposes human rationales into atomic units and uses LLMs for strict one-to-one semantic matching
📊 Rationale Consistency Metric: Effectively detects deceptive alignment and distinguishes frontier models (e.g., GPT-5 or Gemini 3 Pro)
🛠️ Hybrid Reward Training: Combines rationale reward (Average Precision) and outcome reward to prevent "rationale degeneration"
🏆 SOTA Performance: Achieves best results on RM-Bench (87.1%) and JudgeBench (82.0%)

---

🚨 Key Finding: The Deceptive Alignment Trap

We evaluated 19 frontier models and found two critical flaws when relying solely on outcome accuracy:

Outcome Accuracy Cannot Distinguish Frontier Models

In the green region, although multiple models achieve similar outcome accuracy, rationale consistency clearly distinguishes stronger models (such as GPT-5, o3, Gemini 3 Pro) from weaker ones (such as Claude 3.5, GPT-4.1).

Outcome Accuracy Cannot Detect Deceptive Alignment

The most typical example is the comparison between o3 and o3-mini: both have similar outcome accuracy, but o3-mini's rationale consistency is nearly 50% lower. o3-mini relies on surface cues (such as formatting, emojis) to make judgments, while o3 performs rigorous fact-checking like humans do.

> 💡 Key Insight: Models can make correct choices for wrong reasons. Outcome accuracy alone cannot detect this deceptive alignment.

---

📉 Training Finding: Outcome-Only Supervision Leads to Rationale Degeneration

Training dynamics comparison: Similar outcome rewards, but significantly different rationale rewards

The figure above shows a key finding during training: outcome-only supervision leads to continuous decline in model-human reasoning process consistency.

Left: Both methods achieve nearly identical outcome rewards, indicating models can learn to select correct answers
Right: Rationale rewards show significant divergence — without rationale consistency constraints, model rationale rewards continuously decline, ultimately 24.2% lower than our method

This reveals the Rationale Degeneration phenomenon: when intermediate reasoning processes are not incentivized, models abandon high-cost evidence verification and instead rely on cheaper surface cues to achieve similar outcome rewards.

---

🏆 Main Results

We evaluate on two challenging benchmarks:

RM-Bench: Evaluates model ability to distinguish subtle differences and style biases
JudgeBench: Emphasizes deep judgment and logical reasoning

| Model | RM-Bench | JudgeBench | Avg | | :------------------------------------- | :------------: | :------------: | :-----------: | | Generative Reward Models | | | | | RM-R1-Distilled-Qwen-32B | 83.9 | 78.8 | 81.4 | | RRM-32B | 73.1 | 75.7 | 74.4 | | Nemotron-Super-49B | 82.7 | 77.2 | 80.0 | | RewardAnything-8B-v1 | 83.1 | 62.6 | 72.9 | | GRAM-R² | 85.7 | 81.0 | 83.4 | | Outcome-Only Baselines | | | | | Qwen3-14B (Outcome-Only) | 83.6 | 70.0 | 76.8 | | Qwen3-30B-A3B (Outcome-Only) | 84.9 | 75.7 | 80.3 | | Our Method (Outcome + Rationale) | | | | | Qwen3-14B (Ours) | 86.7 | 79.1 | 82.9 | | Qwen3-30B-A3B (Ours) | 87.1 | 82.0 | 84.6 |

> 💡 Our method effectively reverses the rationale consistency decline observed during outcome-only training (from 25% to 37%).

---

🚀 Quick Start

Project Structure

RationaleRM/
├── metajudge_infer.py # Semantic matching inference script
├── metajudge_infer.sh # Shell script for running inference
├── metajudge_analysis.py # Analysis script for computing metrics
├── images/ # Images
│ ├── overall_compare.png
│ └── reward_compare.png
├── data/ # Datasets
│ ├── helpsteer3_test_1000.jsonl # Test set: 1000 samples
│ └── helpsteer3_human_checklist.jsonl # Full dataset (22,116 samples)
└── example/ # Example data for testing
├── infer_input_10samples.jsonl
├── model-low_deceptive_alignment.jsonl
└── model-high_deceptive_alignment.jsonl

Step 1: Prepare Data

Input data should be in JSONL format with the following fields:

human-checklist: List of human atomic rationales (reference)
{model}-checklist: List of model-generated atomic rationales to be evaluated

Example:

{
"domain": "general",
"context": [...],
"response1": "...",
"response2": "...",
"human-checklist": [
"Response 1 lacks polysyllabic rhymes",
"Response 2's meter is inconsistent"
],
"model-low_deceptive_alignment-checklist": [
"Response A's rhyme scheme is forced",
"Response B's rhythm feels awkward"
]
}

Step 2: Run Inference

The inference script evaluates how well each model-generated checklist item matches the human checklist:

# Set environment variables
export OPENAI_API_KEY="your-api-key"
export OPENAI_BASE_URL="https://api.openai.com/v1" # Optional, defaults to OpenAI

# Run inference
python metajudge_infer.py \
--input-file data/helpsteer3_test_1000.jsonl \
--output-file output/results.jsonl \
--model gpt-4o \
--model-be-evaluated model-low_deceptive_alignment \
--concurrent-requests 5

Or use the shell script:

bash metajudge_infer.sh

Key parameters:

--input-file: Path to input JSONL file
--output-file: Path for output results
--model: LLM model for semantic matching (e.g., gpt-4o, qwen-plus)
--model-be-evaluated: The critic model whose checklist will be evaluated
--concurrent-requests: Number of parallel API requests

API configuration (via environment...

Excerpt shown — open the source for the full document.

Notability

notability 4.0/10

Low-star repo from Qwen team