inclusionAI/Ring-V2
Python
Captured source
source ↗inclusionAI/Ring-V2
Description: Ring-V2 is a reasoning MoE LLM provided and open-sourced by InclusionAI.
Language: Python
License: MIT
Stars: 99
Forks: 8
Open issues: 1
Created: 2025-09-22T15:49:26Z
Pushed: 2025-10-23T06:26:07Z
Default branch: main
Fork: no
Archived: no
README:
Ring-V2
🤗 Hugging Face   |   🤖 ModelScope
News
- [2025-10]:🎉 Add Ring-linear-2.0 📖 Technical Report
- [2025-10]:🎉 Add Ring-1T Model
- [2025-09]:🎉 Add Ring-linear-2.0 Series
- [2025-09]:🎉 Add Ring-flash-2.0 Model
- [2025-09]:🎉 Add Ring-mini-2.0 Model
Introduction
Ring-V2 is a family of reasoning MoE LLMs with a range of sizes provided and open-sourced by InclusionAI, derived from Ling-V2. These models achieve leading performance in complex reasoning at similar sizes, while maintaining high inference speed thanks to their highly sparse architecture.
Model Downloads
| Model | Context Length | Download | |:----------------------:| :----------------: |:--------------------------------------------------------------------------------------------------------------------------------------------------------------------:| | Ring-1T | 64K -> 128K (YaRN) | 🤗 HuggingFace 🤖 ModelScope | | Ring-1T-FP8 | 64K -> 128K (YaRN) | 🤗 HuggingFace 🤖 ModelScope | | Ring-flash-2.0 | 32K -> 128K (YaRN) | 🤗 HuggingFace 🤖 ModelScope | | Ring-mini-2.0 | 32K -> 128K (YaRN) | 🤗 HuggingFace 🤖 ModelScope |
Note: If you are interested in previous version, please visit the past model collections in Huggingface or ModelScope.
Quickstart
🤗 Hugging Face Transformers
Here is a code snippet to show you how to use the chat model with transformers:
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "inclusionAI/Ring-flash-2.0"
model = AutoModelForCausalLM.from_pretrained(
model_name,
dtype="auto",
device_map="auto",
trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
prompt = "Give me a short introduction to large language models."
messages = [
{"role": "system", "content": "You are Ring, an assistant created by inclusionAI"},
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt", return_token_type_ids=False).to(model.device)
generated_ids = model.generate(
**model_inputs,
max_new_tokens=8192
)
generated_ids = [
output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]🤖 ModelScope
If you're in mainland China, we strongly recommend you to use our model from 🤖 ModelScope.
Deployment
vLLM
vLLM supports offline batched inference or launching an OpenAI-Compatible API Service for online inference.
Environment Preparation
Since the Pull Request (PR) has not been submitted to the vLLM community at this stage, please prepare the environment by following the steps below:
git clone -b v0.10.0 https://github.com/vllm-project/vllm.git cd vllm wget https://raw.githubusercontent.com/inclusionAI/Ring-V2/refs/heads/main/inference/vllm/bailing_moe_v2.patch git apply bailing_moe_v2.patch pip install -e .
Offline Inference:
from transformers import AutoTokenizer
from vllm import LLM, SamplingParams
tokenizer = AutoTokenizer.from_pretrained("inclusionAI/Ring-flash-2.0")
sampling_params = SamplingParams(temperature=0.7, top_p=0.8, repetition_penalty=1.05, max_tokens=16384)
llm = LLM(model="inclusionAI/Ring-flash-2.0", dtype='bfloat16')
prompt = "Give me a short introduction to large language models."
messages = [
{"role": "system", "content": "You are Ring, an assistant created by inclusionAI"},
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
outputs = llm.generate([text], sampling_params)Online Inference:
vllm serve inclusionAI/Ring-flash-2.0 \ --tensor-parallel-size 2 \ --pipeline-parallel-size 1 \ --use-v2-block-manager \ --gpu-memory-utilization 0.90
To handle long context in vLLM using YaRN, we need to follow these two steps: 1. Add a rope_scaling field to the model's config.json file, for example:
{
...,
"rope_scaling": {
"factor": 4.0,
"original_max_position_embeddings": 32768,
"type": "yarn"
}
}2. Use an additional parameter --max-model-len to specify the desired maximum context length when starting the vLLM service.
For detailed guidance, please refer to the vLLM `instructions`.
SGLang
Environment Preparation
We will later submit our model to SGLang official release, now we can prepare the environment following steps:
pip3 install sglang==0.5.2rc0 sgl-kernel==0.3.7.post1
You can use docker image as well:
docker pull lmsysorg/sglang:v0.5.2rc0-cu126
Then you should apply patch to sglang installation:
# patch command is needed, run `yum install -y patch` if needed patch -d `python -c 'import sglang;import os; print(os.path.dirname(sglang.__file__))'` -p3 < inference/sglang/bailing_moe_v2.patch
Run Inference
BF16 and FP8 models are supported by SGLang now, it depends on the dtype of the model in ${MODEL_PATH}. They both share the same command in the following:
- Start server:
python -m sglang.launch_server \ --model-path $MODLE_PATH \ --host 0.0.0.0 --port $PORT \ --trust-remote-code \ --attention-backend fa3
MTP is supported for base model, and not yet for…
Excerpt shown — open the source for the full document.
Notability
notability 5.0/10New repo with moderate traction.