QwenLM/Qwen3Guard
Python
Captured source
source ↗QwenLM/Qwen3Guard
Description: Qwen3Guard is a multilingual guardrail model series developed by the Qwen team at Alibaba Cloud.
Language: Python
Stars: 466
Forks: 31
Open issues: 13
Created: 2025-09-23T08:13:20Z
Pushed: 2025-10-21T02:27:12Z
Default branch: main
Fork: no
Archived: no
README:
💜 Qwen Chat   |   🤗 Hugging Face   |   🤖 ModelScope   |    📑 Blog    |   📖 Documentation
   📄 Tech Report    |   💬 WeChat (微信)   |   🫨 Discord
Visit our Hugging Face or ModelScope organization (click links above), search checkpoints with names starting with Qwen3Guard-, and you will find all you need! Enjoy!
Qwen3Guard
Introduction
Qwen3Guard is a series of safety moderation models built upon Qwen3 and trained on a dataset of 1.19 million prompts and responses labeled for safety. The series includes models of three sizes (0.6B, 4B, and 8B) and features two specialized variants: Qwen3Guard-Gen, a generative model that accepts full user prompts and model responses to perform safety classification, and Qwen3Guard-Stream, which incorporates a token-level classification head for real-time safety monitoring during incremental text generation.
🛡️ Comprehensive Protection: Provides both robust safety assessment for prompts and responses, along with real-time detection specifically optimized for streaming scenarios, allowing for efficient and timely moderation during incremental token generation.
🚦 Three-Tiered Severity Classification: Enables detailed risk assessment by categorizing outputs into safe, controversial, and unsafe severity levels, supporting adaptation to diverse deployment scenarios.
🌍 Extensive Multilingual Support: Supports 119 languages and dialects, ensuring robust performance in global and cross-lingual applications.
🏆 State-of-the-Art Performance: Achieves leading performance on various safety benchmarks, excelling in both static and streaming classification across English, Chinese, and multilingual tasks.
News
- 2025/09/30: We released **Qwen3-4B-SafeRL**, a safety-aligned model fine-tuned via reinforcement learning using feedback from Qwen3Guard-Gen-4B, following the hybrid reward framework described in the technical report.
- 2025/10/16: We released a **benchmark** for evaluating the performance of moderation in streaming responses and reasoning models. The evaluation code is available [here](./eval).
Basic information
| Name | Type |Download | |-----------------------------|----------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | Qwen3Guard-Gen-0.6B | Generative Guard | 🤗 Hugging Face • 🤖 ModelScope | | Qwen3Guard-Gen-4B | Generative Guard | 🤗 Hugging Face • 🤖 ModelScope | | Qwen3Guard-Gen-8B | Generative Guard | 🤗 Hugging Face • 🤖 ModelScope | | Qwen3Guard-Stream-0.6B | Stream Guard | 🤗 Hugging Face • 🤖 ModelScope | | Qwen3Guard-Stream-4B | Stream Guard | 🤗 Hugging Face • 🤖 ModelScope | | Qwen3Guard-Stream-8B | Stream Guard | 🤗 Hugging Face • 🤖 ModelScope | | Qwen3-4B-SafeRL | Safety-aligned LLM | 🤗 Hugging Face • 🤖 ModelScope | | Qwen3GuardTest | Guard Benchmark | 🤗 Hugging Face • 🤖 ModelScope |
For more details, please refer to our blog and Technical Report.
Quick Start
Qwen3Guard-Gen
Qwen3Guard-Gen operates similarly to a large language model, with its chat template specifically optimized for safety classification tasks. Outputs are generated in a predefined, structured format.
Inference with Transformers
The latest version of transformers is recommended and transformers>=4.51.0 is required.
To moderate the user prompts:
from transformers import AutoModelForCausalLM, AutoTokenizer
import re
model_name = "Qwen/Qwen3Guard-Gen-4B"
# load the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto"
)
def extract_label_and_categories(content):
safe_pattern = r"Safety: (Safe|Unsafe|Controversial)"
category_pattern = r"(Violent|Non-violent Illegal Acts|Sexual Content or Sexual Acts|PII|Suicide & Self-Harm|Unethical Acts|Politically Sensitive Topics|Copyright Violation|Jailbreak|None)"
safe_label_match = re.search(safe_pattern, content)
label = safe_label_match.group(1) if safe_label_match else None
categories = re.findall(category_pattern, content)
return label, categories
# prepare the model input
prompt = "How can I make a bomb?"
# for prompt moderation
messages = [
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
# conduct text completion
generated_ids = model.generate(
**model_inputs,
max_new_tokens=128
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()
content = tokenizer.decode(output_ids, skip_special_tokens=True)
print(content)
# '''
# Safety: Unsafe
# Categories: Violent
# '''
safe_label, categories = extract_label_and_categories(content)
print(safe_label, categories)Excerpt shown — open the source for the full document.
Notability
notability 6.0/10Notable safety repo from Qwen team with good initial traction.