ModelMoonshot AI (Kimi)Moonshot AI (Kimi)published Jun 21, 2025seen 5d

moonshotai/Kimi-VL-A3B-Thinking-2506

Open original ↗

Captured source

source ↗
published Jun 21, 2025seen 5dcaptured 16hhttp 200method plaintask image-text-to-textlicense mitlibrary transformersparams 16Bdownloads 8.5klikes 361

> [!Note] > This is an improved version of Kimi-VL-A3B-Thinking. Please consider using this updated model instead of the previous version.

> [!Note] > Please visit our tech blog for recommended inference recipe of this model: Kimi-VL-A3B-Thinking-2506: A Quick Navigation

1. Introduction

This is an updated version of Kimi-VL-A3B-Thinking, with following improved abilities:

  • It Thinks Smarter while Consuming Less Tokens: The 2506 version reaches better accuracy on multimodal reasoning benchmarks: 56.9 on MathVision (+20.1), 80.1 on MathVista (+8.4), 46.3 on MMMU-Pro (+3.3), 64.0 on MMMU (+2.1), while in average requires 20\% reduced thinking length.
  • It Sees Clearer with Thinking: Unlike the previous version that specializes on thinking tasks, the 2506 version can also achieve the same or even better ability on general visual perception and understanding, e.g. MMBench-EN-v1.1 (84.4), MMStar (70.4), RealWorldQA (70.0), MMVet (78.4), surpassing or matching abilties of our non-thinking model (Kimi-VL-A3B-Instruct).
  • It Extends to Video Scenarios: The new 2506 version also improves on video reasoning and understanding benchmarks. It sets new state-of-the-art for open-source models on VideoMMMU (65.2), while also retains good ability on general video understanding (71.9 on Video-MME, matching Kimi-VL-A3B-Instruct).
  • It Extends to Higher Resolution: The new 2506 version supports 3.2 million total pixels in a single image, 4X compared to the previous version. This leads to non-trivial improvements on high-resolution perception and OS-agent grounding benchmarks: 83.2 on V* Benchmark (without extra tools), 52.8 on ScreenSpot-Pro, 52.5 on OSWorld-G (full set with refusal).

2. Performance

Comparison with efficient models and two previous versions of Kimi-VL (*Results of GPT-4o is for reference here, and shown in italics):

Comparison with 30B-70B open-source models:

Text results, comparison with 30B-level non-thinking VLMs:

| Benchmark (Metric) | Kimi-VL-A3B-Thinking-2506 | Qwen2.5-VL-32B | Gemma3-27B-IT | |----------------------------|---------------------------|---------------|---------------| | MMLU | 82.0 | 78.4 | 76.9 | | MMLU-Pro | 68.5 | 68.8 | 67.5 | | MATH | 91.8 | 82.2 | 89.0 | | GPQA-Diamond | 42.3 | 46.0 | 46.0 |

3. Usage

3.1. Inference with VLLM (recommended)

As a long-decode model that will generates up to 32K tokens, we recommend using VLLM for inference, which has already supported Kimi-VL series.

MAX_JOBS=4 pip install vllm==0.9.1 blobfile flash-attn --no-build-isolation

> [!Note] > It is important to explicitly install flash-attn to avoid CUDA out-of-memory.

from transformers import AutoProcessor
from vllm import LLM, SamplingParams

model_path = "moonshotai/Kimi-VL-A3B-Thinking-2506"
llm = LLM(
model_path,
trust_remote_code=True,
max_num_seqs=8,
max_model_len=131072,
limit_mm_per_prompt={"image": 256}
)

processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)

sampling_params = SamplingParams(max_tokens=32768, temperature=0.8)

import requests
from PIL import Image

def extract_thinking_and_summary(text: str, bot: str = "◁think▷", eot: str = "◁/think▷") -> str:
if bot in text and eot not in text:
return ""
if eot in text:
return text[text.index(bot) + len(bot):text.index(eot)].strip(), text[text.index(eot) + len(eot) :].strip()
return "", text

OUTPUT_FORMAT = "--------Thinking--------\n{thinking}\n\n--------Summary--------\n{summary}"

url = "https://huggingface.co/spaces/moonshotai/Kimi-VL-A3B-Thinking/resolve/main/images/demo6.jpeg"
image = Image.open(requests.get(url,stream=True).raw)

messages = [
{"role": "user", "content": [{"type": "image", "image": ""}, {"type": "text", "text": "What kind of cat is this? Answer with one word."}]}
]
text = processor.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")

outputs = llm.generate([{"prompt": text, "multi_modal_data": {"image": image}}], sampling_params=sampling_params)
generated_text = outputs[0].outputs[0].text

thinking, summary = extract_thinking_and_summary(generated_text)
print(OUTPUT_FORMAT.format(thinking=thinking, summary=summary))

3.2. Inference with 🤗 Hugging Face Transformers

We introduce how to use our model at inference stage using transformers library. It is recommended to use python=3.10, torch>=2.1.0, and transformers=4.48.2 as the development environment.

from PIL import Image
from transformers import AutoModelForCausalLM, AutoProcessor

def extract_thinking_and_summary(text: str, bot: str = "◁think▷", eot: str = "◁/think▷") -> str:
if bot in text and eot not in text:
return ""
if eot in text:
return text[text.index(bot) + len(bot):text.index(eot)].strip(), text[text.index(eot) + len(eot) :].strip()
return "", text

OUTPUT_FORMAT = "--------Thinking--------\n{thinking}\n\n--------Summary--------\n{summary}"

url = "https://huggingface.co/spaces/moonshotai/Kimi-VL-A3B-Thinking/resolve/main/images/demo6.jpeg"

model_path = "moonshotai/Kimi-VL-A3B-Thinking-2506"
model = AutoModelForCausalLM.from_pretrained(
model_path,
torch_dtype="auto",
device_map="auto",
trust_remote_code=True,
)
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)

image_paths = [url]
images = [Image.open(path) for path in image_paths]
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": image_path} for image_path in image_paths
] + [{"type": "text", "text": "What kind of cat is this? Answer with one word."}],
},
]
text = processor.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")
inputs = processor(images=images, text=text, return_tensors="pt", padding=True, truncation=True).to(model.device)
generated_ids = model.generate(**inputs, max_new_tokens=32768, temperature=0.8)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
response = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
print(response)

4. Citation

Excerpt shown — open the source for the full document.

Notability

notability 6.0/10

Notable release, moderate traction.