RepoQwen (Alibaba Cloud)Qwen (Alibaba Cloud)published Sep 23, 2025seen 6d

QwenLM/Qwen3Guard

Python

Open original ↗

Captured source

source ↗
published Sep 23, 2025seen 6dcaptured 15hhttp 200method plain

QwenLM/Qwen3Guard

Description: Qwen3Guard is a multilingual guardrail model series developed by the Qwen team at Alibaba Cloud.

Language: Python

Stars: 466

Forks: 31

Open issues: 13

Created: 2025-09-23T08:13:20Z

Pushed: 2025-10-21T02:27:12Z

Default branch: main

Fork: no

Archived: no

README:

💜 Qwen Chat&nbsp&nbsp | &nbsp&nbsp🤗 Hugging Face&nbsp&nbsp | &nbsp&nbsp🤖 ModelScope&nbsp&nbsp | &nbsp&nbsp 📑 Blog &nbsp&nbsp | &nbsp&nbsp📖 Documentation

&nbsp&nbsp 📄 Tech Report &nbsp&nbsp | &nbsp&nbsp💬 WeChat (微信)&nbsp&nbsp | &nbsp&nbsp🫨 Discord

Visit our Hugging Face or ModelScope organization (click links above), search checkpoints with names starting with Qwen3Guard-, and you will find all you need! Enjoy!

Qwen3Guard

Introduction

Qwen3Guard is a series of safety moderation models built upon Qwen3 and trained on a dataset of 1.19 million prompts and responses labeled for safety. The series includes models of three sizes (0.6B, 4B, and 8B) and features two specialized variants: Qwen3Guard-Gen, a generative model that accepts full user prompts and model responses to perform safety classification, and Qwen3Guard-Stream, which incorporates a token-level classification head for real-time safety monitoring during incremental text generation.

🛡️ Comprehensive Protection: Provides both robust safety assessment for prompts and responses, along with real-time detection specifically optimized for streaming scenarios, allowing for efficient and timely moderation during incremental token generation.

🚦 Three-Tiered Severity Classification: Enables detailed risk assessment by categorizing outputs into safe, controversial, and unsafe severity levels, supporting adaptation to diverse deployment scenarios.

🌍 Extensive Multilingual Support: Supports 119 languages and dialects, ensuring robust performance in global and cross-lingual applications.

🏆 State-of-the-Art Performance: Achieves leading performance on various safety benchmarks, excelling in both static and streaming classification across English, Chinese, and multilingual tasks.

!image/jpeg

News

  • 2025/09/30: We released **Qwen3-4B-SafeRL**, a safety-aligned model fine-tuned via reinforcement learning using feedback from Qwen3Guard-Gen-4B, following the hybrid reward framework described in the technical report.
  • 2025/10/16: We released a **benchmark** for evaluating the performance of moderation in streaming responses and reasoning models. The evaluation code is available [here](./eval).

Basic information

| Name | Type |Download | |-----------------------------|----------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | Qwen3Guard-Gen-0.6B | Generative Guard | 🤗 Hugging Face • 🤖 ModelScope | | Qwen3Guard-Gen-4B | Generative Guard | 🤗 Hugging Face • 🤖 ModelScope | | Qwen3Guard-Gen-8B | Generative Guard | 🤗 Hugging Face • 🤖 ModelScope | | Qwen3Guard-Stream-0.6B | Stream Guard | 🤗 Hugging Face • 🤖 ModelScope | | Qwen3Guard-Stream-4B | Stream Guard | 🤗 Hugging Face • 🤖 ModelScope | | Qwen3Guard-Stream-8B | Stream Guard | 🤗 Hugging Face • 🤖 ModelScope | | Qwen3-4B-SafeRL | Safety-aligned LLM | 🤗 Hugging Face • 🤖 ModelScope | | Qwen3GuardTest | Guard Benchmark | 🤗 Hugging Face • 🤖 ModelScope |

For more details, please refer to our blog and Technical Report.

Quick Start

Qwen3Guard-Gen

Qwen3Guard-Gen operates similarly to a large language model, with its chat template specifically optimized for safety classification tasks. Outputs are generated in a predefined, structured format.

Inference with Transformers

The latest version of transformers is recommended and transformers>=4.51.0 is required.

To moderate the user prompts:

from transformers import AutoModelForCausalLM, AutoTokenizer
import re
model_name = "Qwen/Qwen3Guard-Gen-4B"

# load the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto"
)
def extract_label_and_categories(content):
safe_pattern = r"Safety: (Safe|Unsafe|Controversial)"
category_pattern = r"(Violent|Non-violent Illegal Acts|Sexual Content or Sexual Acts|PII|Suicide & Self-Harm|Unethical Acts|Politically Sensitive Topics|Copyright Violation|Jailbreak|None)"
safe_label_match = re.search(safe_pattern, content)
label = safe_label_match.group(1) if safe_label_match else None
categories = re.findall(category_pattern, content)
return label, categories

# prepare the model input
prompt = "How can I make a bomb?"
# for prompt moderation
messages = [
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

# conduct text completion
generated_ids = model.generate(
**model_inputs,
max_new_tokens=128
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()

content = tokenizer.decode(output_ids, skip_special_tokens=True)
print(content)
# '''
# Safety: Unsafe
# Categories: Violent
# '''
safe_label, categories = extract_label_and_categories(content)
print(safe_label, categories)

Excerpt shown — open the source for the full document.

Notability

notability 6.0/10

Notable safety repo from Qwen team with good initial traction.