RepoZhipu AI (GLM)Zhipu AI (GLM)published Jun 28, 2025seen 5d

zai-org/GLM-V

Python

Open original ↗

Captured source

source ↗
published Jun 28, 2025seen 5dcaptured 17hhttp 200method plain

zai-org/GLM-V

Description: GLM-4.6V/4.5V/4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

Language: Python

License: Apache-2.0

Stars: 2326

Forks: 171

Open issues: 12

Created: 2025-06-28T08:44:06Z

Pushed: 2026-05-16T05:42:14Z

Default branch: main

Fork: no

Archived: no

README:

GLM-V

[中文阅读.](./README_zh.md)

👋 Join our WeChat and Discord communities.

📖 Check out the GLM-4.6V blog and GLM-4.5V & GLM-4.1V paper.

📍 Try online or use the API.

Introduction

Vision-language models (VLMs) have become a key cornerstone of intelligent systems. As real-world AI tasks grow increasingly complex, VLMs urgently need to enhance reasoning capabilities beyond basic multimodal perception — improving accuracy, comprehensiveness, and intelligence — to enable complex problem solving, long-context understanding, and multimodal agents.

Through our open-source work, we aim to explore the technological frontier together with the community while empowering more developers to create exciting and innovative applications.

This open-source repository contains our `GLM-4.6V`, `GLM-4.5V` and `GLM-4.1V` series models. For performance and details, see [Model Overview](#model-overview). For known issues, see [Fixed and Remaining Issues](#fixed-and-remaining-issues).

Project Updates

and GLM-skills.

  • News: 2026/03/28: We have released multiple GLM-V related Skills, covering several specialized areas

such as GLM-V-Grounding and GLM-V-Prompt-Gen. You are welcome to try them [here](skills).

  • News: 2025/11/10: We released UI2Code^N, a RL-enhanced UI coding model with UI-to-code, UI-polish, and

UI-edit capabilities. The model is trained based on GLM-4.1V-Base. Check it out here.

  • News: 2025/10/27: We’ve released Glyph, a framework for scaling the context length through visual-text

compression, the glyph model trained based on GLM-4.1V-Base. Check it out here.

  • News: 2025/08/11: We released GLM-4.5V with significant improvements across multiple benchmarks. We also

open-sourced our handcrafted desktop assistant app for debugging. Once connected to GLM-4.5V, it can capture visual information from your PC screen via screenshots or screen recordings. Feel free to try it out or customize it into your own multimodal assistant. Click here to download the installer or [build from source](examples/vllm-chat-helper/README.md)!

  • News: 2025/07/16: We have open-sourced the VLM Reward System used to train GLM-4.1V-Thinking.View

the [code repository](glmv_reward) and run locally: python examples/reward_system_demo.py.

  • News: 2025/07/01: We released GLM-4.1V-9B-Thinking and

its technical report.

Model Implementation Code

  • GLM-4.5V and GLM-4.6V model algorithm: see the full implementation

in transformers.

  • GLM-4.1V-9B-Thinking model algorithm: see the full implementation

in transformers.

  • Both models share identical multimodal preprocessing, but use different conversation templates — please distinguish

carefully.

Model Downloads

| Model | Download Links | Type | |----------------------|------------------------------------------------------------------------------------------------------------------------------------------------------|------------------| | GLM-4.6V | 🤗 Hugging Face 🤖 ModelScope | Hybrid Reasoning | | GLM-4.6V-FP8 | 🤗 Hugging Face 🤖 ModelScope | Hybrid Reasoning | | GLM-4.6V-Flash | 🤗 Hugging Face 🤖 ModelScope | Hybrid Reasoning | | GLM-4.5V | 🤗 Hugging Face 🤖 ModelScope | Hybrid Reasoning | | GLM-4.5V-FP8 | 🤗 Hugging Face 🤖 ModelScope | Hybrid Reasoning | | GLM-4.1V-9B-Thinking | 🤗 Hugging Face 🤖 ModelScope | Reasoning | | GLM-4.1V-9B-Base | 🤗 Hugging Face 🤖 ModelScope | Base |

+ Hugging Face provides GGUF format model weights. You can download the GGUF format model of GLM-V from here.

Using Case

Grounding

GLM-4.5V / GLM-4.6V / GLM-4.1V equips precise grounding capabilities. Given a prompt that requests the location of a specific object, the model is able to reasoning step-by-step and identify the bounding boxes of the target object. The query prompt supports complex descriptions of the target object as well as specified output formats, for example: > > - Help me to locate in the image and give me its bounding boxes. > - Please pinpoint the bounding box [[x1,y1,x2,y2], …] in the image as per the given description.

Here, `` is the description of the target object. The output bounding box is a quadruple $$[x_1,y_1,x_2,y_2]$$ composed of the coordinates of the top-left and bottom-right corners, where each value is normalized by the image width (for x) or height (for y) and scaled by 1000.

In the response, the special tokens ` and ` are used to mark the image bounding box in the answer. The bracket style may vary ([], [[]], (), <>, etc.), but the meaning is the same: to enclose the coordinates of the box.

GUI Agent

  • examples/gui-agent: Demonstrates prompt construction and output handling for GUI Agents, including strategies for

mobile, PC, and web. Prompt templates differ between GLM-4.1V and GLM-4.5V.

Quick Demo

  • examples/vlm-helper: A desktop assistant for GLM multimodal models…

Excerpt shown — open the source for the full document.

Notability

notability 7.0/10

Notable model release from Zhipu with strong traction.