WritingAmazon (Nova)Amazon (Nova)published May 26, 2026seen 5d

Diverse reasoning traces teach LLMs to make better decisions

Open original ↗

Captured source

source ↗

Training LLMs to reason in oarallel: How global forking tokens improve accuracy - Amazon Science

Close

Close

Social

bluesky

threads

twitter

instagram

youtube

facebook

linkedin

github

rss

Menu

Research

Research areas

Automated reasoning

Cloud and systems

Computer vision

Conversational AI

Economics

Information and knowledge management

Machine learning

Operations research and optimization

Quantum technologies

Robotics

Search and information retrieval

Security, privacy, and abuse prevention

Sustainability

Our scientific contributions

Publications

Research from our scientists and collaborators.

Conferences

Our experts present and discuss cutting-edge research at scientific meetings globally.

Research areas

Automated reasoning

Cloud and systems

Computer vision

Conversational AI

Economics

Information and knowledge management

Machine learning

Operations research and optimization

Quantum technologies

Robotics

Search and information retrieval

Security, privacy, and abuse prevention

Sustainability

Our scientific contributions

Publications

Research from our scientists and collaborators.

Conferences

Our experts present and discuss cutting-edge research at scientific meetings globally.

News & blog

The latest from Amazon researchers

Amazon Science Blog

Technical deep-dives and perspectives from our scientists.

News

Research milestones and recent achievements.

The latest from Amazon researchers

Amazon Science Blog

Technical deep-dives and perspectives from our scientists.

News

Research milestones and recent achievements.

Collaborations

Amazon Research Awards

Overview

Call for proposals

Latest news

Research stories

Recipients

Amazon Nova AI Challenge

Overview

Rules

FAQs

Teams

Research collaborations

Overview

Carnegie Mellon University

Columbia University

Hampton University

Howard University

IIT Bombay

Johns Hopkins University

Max Planck Society

MIT

Tennessee State University

University of California, Los Angeles

University of Illinois Urbana-Champaign

University of Southern California

University of Texas at Austin

Virginia Tech

University of Washington

Amazon Research Awards

Overview

Call for proposals

Latest news

Research stories

Recipients

Amazon Nova AI Challenge

Overview

Rules

FAQs

Teams

Research collaborations

Overview

Carnegie Mellon University

Columbia University

Hampton University

Howard University

IIT Bombay

Johns Hopkins University

Max Planck Society

MIT

Tennessee State University

University of California, Los Angeles

University of Illinois Urbana-Champaign

University of Southern California

University of Texas at Austin

Virginia Tech

University of Washington

Resources

Code and datasets

AGI Labs

Meet the team building useful AI agents.

Amazon Nova

Try Amazon’s frontier foundation models.

Code and datasets

AGI Labs

Meet the team building useful AI agents.

Amazon Nova

Try Amazon’s frontier foundation models.

Careers

Careers

Explore our open roles.

Amazon Scholars

Faculty research opportunities on industry-scale technical challenges.

Postdoctoral Science Program

Early-career research opportunities alongside experienced industry scientists.

Careers

Explore our open roles.

Amazon Scholars

Faculty research opportunities on industry-scale technical challenges.

Postdoctoral Science Program

Early-career research opportunities alongside experienced industry scientists.

Search

Submit Search

Conversational AI

Diverse reasoning traces teach LLMs to make better decisions

How to train language models to generate diverse, accurate reasoning paths using tokens that control distinct reasoning strategies.

By Sheng Jia , Xiao Wang , Shiva Kasiviswanathan

May 26, 2026

5 min read

Share

Share

Copy link

Email

X

LinkedIn

Facebook

Line

Reddit

QZone

Sina Weibo

WeChat

WhatsApp

分享到微信

x

Overview by Amazon Nova

Amazon researchers introduce set-supervised fine-tuning (SSFT) and global forking policy optimization (GFPO) to train language models that generate diverse reasoning paths. SSFT and GFPO improve single-shot accuracy on AIME 2025 and LiveCodeBench benchmarks without mode collapse. Global forking tokens are used to elicit distinct reasoning modes, enabling the model to produce diverse, high-quality reasoning paths. SSFT models reasoning as a set of complete solution paths, while GFPO selects the most effective reasoning mode for each input. The combined approach of SSFT and GFPO results in gains of 5% to 7% in single-shot accuracy on standard benchmarks.

Was this answer helpful?

Large language models (LLMs) are pretrained on huge volumes of unlabeled data, but afterward, they’re typically post-trained on specific tasks such as instruction following, avoiding harmful outputs, and reasoning , or providing justifications for the outputs they generate. Parallel reasoning — in which multiple, diverse reasoning paths are generated and compared for the same problem — is emerging as a key tool for understanding the limits of LLMs’ reasoning capability. It also underpins techniques for testing LLMs such as self-consistency, where multiple reasoning paths are aggregated to improve accuracy. LLMs are generally optimized for reasoning through supervised fine-tuning (SFT), in which each training example is labeled with a single, human-verified reasoning trace. Given the usefulness of parallel reasoning for evaluation, the question naturally arises, Can we expand the limits of LLMs’ reasoning capacities by training them on diverse reasoning traces for each question? In a paper we presented at this year’s International Conference on Learning Representations ( ICLR ), we propose a method for doing just that, which avoids some previously identified pitfalls of parallel reasoning.

For each question, we gather multiple reasoning traces from different models and sources, capturing diverse solution strategies that serve as supervision for parallel reasoning.

To prompt a single LLM to adopt different reasoning strategies, we introduce a set of global forking tokens (such as through in the figure below) in the post-training phase, each intended to elicit a distinct reasoning mode. These tokens enable the model to generate diverse, high-quality reasoning paths for the same problem.

Under naïve SFT, different tokens fail to specialize: they achieve similar accuracy (top) and exhibit comparable reasoning effort (bottom), indicating mode collapse.

However, naïve…

Excerpt shown — open the source for the full document.

Notability

notability 5.0/10

Substantive research post, no traction indicators