baidu/ERNIE-4.5-VL-28B-A3B-Thinking
Captured source
source ↗🚀 Introducing ERNIE-4.5-VL-28B-A3B-Thinking: A Breakthrough in Multimodal AI
🔥 Demo
Model Highlights
Built upon the powerful ERNIE-4.5-VL-28B-A3B architecture, the newly upgraded ERNIE-4.5-VL-28B-A3B-Thinking achieves a remarkable leap forward in multimodal reasoning capabilities. 🧠✨ Through an extensive mid-training phase, the model absorbed a vast and highly diverse corpus of premium visual-language reasoning data. This massive-scale training process dramatically boosted the model's representation power while deepening the semantic alignment between visual and language modalities—unlocking unprecedented capabilities in nuanced visual-textual reasoning. 📊
The model leverages cutting-edge multimodal reinforcement learning techniques on verifiable tasks, integrating GSPO and IcePop strategies to stabilize MoE training combined with dynamic difficulty sampling for exceptional learning efficiency. ⚡ Responding to strong community demand, we've significantly strengthened the model's grounding performance with improved instruction-following capabilities, making visual grounding functions more accessible than ever. 🎯 Additionally, our innovative "Thinking with Images" feature, when paired with tools like image zooming and image search, dramatically elevates the model's ability to process fine-grained details and handle long-tail visual knowledge. 🔍🖼️
Together, these enhancements form a critical foundation for developing sophisticated multimodal agents, empowering developers and researchers to create next-generation AI applications that push the boundaries of what's possible in visual-language understanding. 🤖🌟

Key Capabilities
As a lightweight model that activates only 3B parameters ⚡, ERNIE-4.5-VL-28B-A3B-Thinking closely matches the performance of the industry's top flagship models across various benchmarks. 🚀
- Visual Reasoning 🧠👁️: Bolstered by large-scale reinforcement learning, the model demonstrates exceptional multi-step reasoning, chart analysis, and causal reasoning capabilities in complex visual tasks! 📊✨
- STEM Reasoning 🔬📐: Leveraging its powerful visual abilities, the model achieves a leap in performance on STEM tasks like solving problems from photos, easily handling even complex questions! 🎯💡
- Visual Grounding 📍🎨: Features more precise grounding and flexible instruction execution, easily triggering grounding functions in complex industrial scenarios for a significant efficiency boost! ⚙️💪
- Thinking with Images 🤔🔍: The model thinks like a human, capable of freely zooming in and out of images to grasp every detail and uncover all information. 🖼️✨
- Tool Utilization 🛠️⚡: Empowered by robust tool-calling capabilities, the model can instantly use functions like image search to easily identify long-tail knowledge and achieve comprehensive information retrieval! 🔎📚
- Video Understanding 🎬🎥: The model possesses outstanding temporal awareness and event localization abilities, accurately identifying content changes across different time segments in a video, making video analysis smarter and more efficient! ⏱️🌟
Quickstart
Using transformers Library
Requirement: transformers <= 4.57.6
Here is an example of how to use the transformers library for inference:
import torch
from transformers import AutoProcessor, AutoTokenizer, AutoModelForCausalLM
model_path = 'baidu/ERNIE-4.5-VL-28B-A3B-Thinking'
model = AutoModelForCausalLM.from_pretrained(
model_path,
device_map="auto",
dtype=torch.bfloat16,
trust_remote_code=True
)
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
model.add_image_preprocess(processor)
messages = [
{
"role": "user",
"content": [
{
"type": "text",
"text": "What color clothes is the girl in the picture wearing?"
},
{
"type": "image_url",
"image_url": {
"url": "https://paddlenlp.bj.bcebos.com/datasets/paddlemix/demo_images/example1.jpg"
}
},
]
},
]
text = processor.tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
image_inputs, video_inputs = processor.process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)
device = next(model.parameters()).device
inputs = inputs.to(device)
generated_ids = model.generate(
inputs=inputs['input_ids'].to(device),
**inputs,
max_new_tokens=1024,
use_cache=False
)
output_text = processor.decode(generated_ids[0][len(inputs['input_ids'][0]):])
print(output_text)vLLM Inference
Install vLLM
pip install decord pip install uv uv pip install vllm==0.11.2 --torch-backend=auto
Run vLLM
# 80G*1 GPU,If an error occurs, add the --gpu-memory-utilization 0.95 and try again vllm serve baidu/ERNIE-4.5-VL-28B-A3B-Thinking --trust-remote-code
Run vLLM using reasoning-parser and tool-call-parser
# 80G*1 GPU,If an error occurs, add the --gpu-memory-utilization 0.95 and try again vllm serve baidu/ERNIE-4.5-VL-28B-A3B-Thinking --trust-remote-code \ --reasoning-parser ernie45 \ --tool-call-parser ernie45 \ --enable-auto-tool-choice
Run vLLM for video understanding (ensure your vLLM version includes PR#31274 for accurate timestamp rendering)
vllm serve baidu/ERNIE-4.5-VL-28B-A3B-Thinking --trust-remote-code \
--reasoning-parser ernie45 \
--media-io-kwargs '{"video": {"num_frames": 180, "fps": 2}}'FastDeploy Inference
Quickly deploy services using FastDeploy as shown below. For more detailed usage, refer to the FastDeploy GitHub Repository.
Note: For single-card deployment, at least 48GB of GPU memory is required.
fastdeploy serve --model baidu/ERNIE-4.5-VL-28B-A3B-Thinking \
--max-model-len 131072 \
--max-num-seqs 32 \
--port 8180 \
--quantization wint8 \
--reasoning-parser ernie-45-vl-thinking \
--tool-call-parser ernie-45-vl-thinking \
--mm-processor-kwargs '{"image_max_pixels": 12845056 }'Finetuning with ERNIEKit
ERNIEKit is a training toolkit based on PaddlePaddle, specifically designed for the ERNIE series of…
Excerpt shown — open the source for the full document.
Notability
notability 5.0/10Low downloads but from major lab