ModelInclusionAI (Ant Group)InclusionAI (Ant Group)published Mar 1, 2026seen 5d

inclusionAI/AReaL-SEA-235B-A22B

Open original ↗

Captured source

source ↗
published Mar 1, 2026seen 5dcaptured 11hhttp 200method plaintask text-generationlicense apache-2.0params 235Bdownloads 25likes 6

AReaL-SEA-235B-A22B — Interactive Tool-Using Agent

AReaL-SEA-235B-A22B is a multi-turn interactive tool-using agent fine-tuned from Qwen3-235B-A22B-Thinking-2507 using SFT + Reinforcement Learning with Verifiable Rewards.

Highlights

  • Achieves 81.3% average pass^1 across all three τ²-bench domains, surpassing GPT-5 (80.0%) and Qwen3-Max-Thinking (80.7%).
  • Trained entirely on self-evolving synthetic data — no human annotation required.
  • End-to-end post-training (SFT → RL) powered by [AReaL](https://github.com/inclusionAI/AReaL), using fully asynchronous GRPO with trajectory-level group-relative advantages and dynamic filtering.

Performance

Mix training results on τ²-bench (trained on combined data from all three domains):

| Domain | pass^1 | pass^2 | pass^3 | pass^4 | pass@4 | |---|---|---|---|---|---| | Airline | 71.0 | 68.0 | 66.5 | 66.0 | 80.0 | | Retail | 79.0 | 67.5 | 63.5 | 57.9 | 95.6 | | Telecom | 93.0 | 88.6 | 81.6 | 81.6 | 100.0 | | Average | 81.3 | 74.7 | 70.5 | 68.5 | 91.9 |

Comparison with Frontier Models

| Model | Airline p^1 | Retail p^1 | Telecom p^1 | Avg p^1 | |---|---|---|---|---| | AReaL-SEA-235B-A22B | 71.0 | 79.0 | 93.0 | 81.3 | | Gemini 3.0 Pro | 73.0 | 85.3 | 98.0 | 85.4 | | Claude-Sonnet-4.5 | 70.0 | 86.2 | 98.0 | 84.7 | | GPT-5 | 62.5 | 81.6 | 95.8 | 80.0 | | Qwen3-Max-Thinking | 71.0 | 75.4 | 95.8 | 80.7 | | Deepseek-v3.2 | 63.8 | 81.1 | 96.2 | 80.4 |

Training

Method

1. Synthetic Data Generation: A hierarchical self-evolving multi-agent framework generates multi-turn tool-use dialogues with executable per-instance verification functions, covering three domains: Airline, Retail, and Telecom. 2. Supervised Fine-Tuning (SFT): The base model is first fine-tuned on the synthetic dialogues. 3. Reinforcement Learning (GRPO): The SFT checkpoint is further trained via GRPO with trajectory-level group-relative advantages, dynamic filtering, and verifier-based outcome rewards. A fine-tuned user model ensures stable rollouts.

Infrastructure

All RL training is conducted using the [AReaL](https://github.com/inclusionAI/AReaL) framework on 80 H200 GPUs (10 nodes). AReaL's fully asynchronous pipeline decouples rollout generation from policy training, maximizing GPU utilization for large-scale multi-turn agentic RL.

Hyperparameters

| | SFT | RL | |---|---|---| | Batch Size | 128 | 256 (16×16) | | Learning Rate | 1e-5 | 1e-5 | | Epochs / Steps | 10 epochs | — | | Max Context Length | 32,768 | 32,768 | | Max Gen Tokens / Turn | — | 8,192 | | Temperature | — | 1.0 |

Training Data

This repo includes the synthetic training data:

| File | Description | Samples | |---|---|---| | sft_merge.jsonl | SFT training data (all 3 domains) | 33,531 | | rl_merge.jsonl | RL training data with verification functions | 1,982 | | tau2_rl_database/ | Environment database states for RL rollouts | — |

Data Format

Each sample in rl_merge.jsonl contains:

  • id: Unique task identifier (e.g., airline_1, telecom_1)
  • user_scenario: User persona, instructions, known information, and behavioral guidance
  • evaluation_criteria: Ground-truth action sequences and assertion-based verification functions
  • db_path: Path to the corresponding environment database

Usage

The model can be used as a drop-in replacement for any Qwen3-235B-A22B-compatible inference setup. For τ²-bench evaluation:

# Follow the τ²-bench evaluation protocol
# Use GPT-4.1 as user simulator for fair comparison
# Report pass^k metrics (all k attempts must succeed)

For integration with the AReaL training framework, refer to the Tau2 Customer Service example.

Citation

@article{gao2025sea,
title={From Self-Evolving Synthetic Data to Verifiable-Reward RL: Post-Training Multi-turn Interactive Tool-Using Agents},
author={Gao, Jiaxuan and Chen, Jiaao and He, Chuyi and Wang, Wei-Chen and Xu, Shusheng and Wang, Hanrui and Jin, Di and Wu, Yi},
journal={arXiv preprint arXiv:2601.22607},
year={2025}
}

@article{fu2025areal,
title={AReaL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning},
author={Fu, Wei and Gao, Jiaxuan and Shen, Xujie and Zhu, Chen and Mei, Zhiyu and He, Chuyi and Xu, Shusheng and Wei, Guo and Mei, Jun and Wang, Jiashu and Yang, Tongkai and Yuan, Binhang and Wu, Yi},
journal={arXiv preprint arXiv:2505.24298},
year={2025}
}

Notability

notability 3.0/10

Low downloads, routine model release