What does this model signal mean?

InclusionAI (Ant Group) published inclusionAI/DR-Venus-4B-RL. This model signal is evidence of what shipped on model infrastructure and how the release is positioned. High-signal details: 65 HF downloads · A 4B-parameter language model fine-tuned with reinforcement learning for reasoning.. onlylabs links this event to 1 captured evidence page and 6 related model signals.

InclusionAI (Ant Group) Model: inclusionAI/DR-Venus-4B-RL

Captured source

source ↗

Hugging Face/huggingface.co/inclusionAI/DR-Venus-4B-RL

inclusionAI/DR-Venus-4B-RL model card

Source ↗

published Apr 21, 2026seen Jun 6captured Jun 11http 200method plainparams 4.4Bdownloads 65likes 14

DR-Venus-4B-RL

DR-Venus-4B-RL is the reinforcement-learned DR-Venuss checkpoint built on top of inclusionAI/DR-Venus-4B-SFT. It is a 4B deep research agent designed for long-horizon web research with explicit tool use, evidence collection, and answer generation.

This model is trained entirely on open data. Starting from the SFT checkpoint, DR-Venus-4B-RL applies long-horizon agentic RL with IGPO-style information gain rewards and format-aware turn-level supervision to improve execution reliability under long tool-use trajectories.

What This Model Is For

This checkpoint is intended for:

long-horizon deep research with tool-augmented reasoning
improving execution reliability beyond supervised imitation
evidence-grounded answering with search and visit
deployment in the official DR-Venuss inference pipeline

s It is not primarily optimized for:

plain chat without tools
generic short-context instruction following
use cases that do not need multi-step retrieval and browsing

Model Details

Base model: Qwen/Qwen3-4B-Thinking-2507
Initialization checkpoint: inclusionAI/DR-Venus-4B-SFT
Training stage: agentic reinforcement learning
Training framework: `verl` + IGPO algorithm
Tool setting: search + visit
Maximum rollout horizon: 200 interaction steps
Maximum rollout context length: 256K
Intended domain: long-horizon open-domain research and evidence-grounded question answering

How DR-Venus Builds RL Supervision

DR-Venus-4B-RL is trained with dense turn-level supervision tailored to deep research:

1. The model starts from the DR-Venus supervised checkpoint. 2. For each query, the agent interacts with the environment over multi-turn search and visit trajectories. 3. IGPO uses information gain rewards to measure whether an intermediate turn increases the model's probability of producing the ground-truth answer. 4. Information gain rewards are combined with outcome rewards and turn-level format-aware penalties. 5. The policy is optimized using an IGPO objective with fine-grained credit assignment, specifically tailored for the long-horizon nature of deep research rollouts.

This design improves supervision density, credit assignment, and data efficiency compared with sparse trajectory-level RL alone.

Training Data

This model is trained from open-data supervision constructed from:

the DR-Venus SFT checkpoint as initialization
REDSearcher 1K RL query-answer pairs
online rollouts with the DR-Venus search + visit tool environment

In the current paper setup:

RL is performed entirely on open query-answer pairs
rollout groups are sampled with long-horizon agent interaction
generation is performed with up to 200 interaction steps per query

For more implementation details, please refer to the DR-Venuss GitHub repository.

Training Recipe

The RL checkpoint is trained with the following setup reported in the current paper draft:

algorithm: IGPO-style agentic RL
rollout group size: 8
training batch size: 16
learning rate: 1e-6
rollout temperature: 1.0
rollout top-p: 0.95
maximum context length: 256K
maximum generation length per turn: 8,192
discount factor: 0.95
format penalty scale: 1.0
training framework: `verl` with vLLM rollout engine and FSDP trainer

The current paper configuration also enables browse-aware IG assignment and IG-scale style reward balancing.

Evaluation Summary

DR-Venus-4B-RL improves over the SFT checkpoint on most tracked deep research benchmarks and sets a stronger small-model frontier.

Results Against Open Models Under 9B

| Model | BrowseComp | BrowseComp-ZH | GAIA (Text-Only) | xBench-DS-2505 | xBench-DS-2510 | DeepSearchQA | | --- | ---: | ---: | ---: | ---: | ---: | ---: | | DeepDive-9B-SFT | 5.6 | 15.7 | -- | 35.0 | -- | -- | | DeepDive-9B-RL | 6.3 | 15.1 | -- | 38.0 | -- | -- | | WebSailor-7B | 6.7 | 14.2 | 37.9 | 34.3 | -- | -- | | OffSeeker-8B-SFT | 10.6 | 24.2 | 47.6 | 48.0 | -- | -- | | OffSeeker-8B-DPO | 12.8 | 26.6 | 51.5 | 49.0 | -- | -- | | WebExplorer-8B-RL | 15.7 | 32.0 | 50.0 | 53.7 | 23.0 | 17.8 | | AgentCPM-Explore-4B | 24.1 | 29.1 | 63.9 | 70.0 | 34.0 | 32.8 | | DR-Venus-4B-SFT | 26.8 | 35.7 | 65.4 | 69.0 | 35.3 | 37.7 | | DR-Venus-4B-RL | 29.1 | 37.7 | 64.4 | 74.7 | 40.7 | 39.6 |

Relative to the SFT checkpoint, DR-Venus-4B-RL improves:

BrowseComp by +2.3
BrowseComp-ZH by +2.0
xBench-DS-2505 by +5.7
xBench-DS-2510 by +5.4
DeepSearchQA by +1.9

These gains are associated with better formatting accuracy, more reliable tool use, and stronger long-horizon execution stability.

Usage

This checkpoint should be used with the official DR-Venuss inference pipeline.

git clone https://github.com/inclusionAI/DR-Venus
cd DR-Venus/Inference
pip install -r requirements.txt
# then configure the model path in run_demo.sh or run_web_demo.sh
bash run_demo.sh

For reproducing RL training or understanding the rollout setup, see the `RL` directory in the official repository.

License and Release Notes

Please verify license compatibility with:

the upstream base model
the released supervision data
the external tools and judge models used in training or evaluation

This section can be updated later with the final project-specific license statement.

Citation

If you use this checkpoint, please cite the DR-Venuss project.

@article{venus2026drvenus,
title={DR-Venus: Towards Frontier Edge-Scale Deep Research Agents with Only 10K Open Data},
author={Venus Team and Dai, Sunhao and Deng, Yong and Lin, Jinzhen and Song, Yusheng and Wang, Guoqing and Wu, Xiaofeng and Zhou, Yuqi and Yang, Shuo and Ying, Zhenzhe and Zhang, Zhanwei and Meng, Changhua and Wang, Weiqiang},
journal={arXiv...

Excerpt shown — open the source for the full document.

Notability

notability 3.0/10

Low download count, small model.