Ming-UniVision: Joint Image Understanding and Generation via a Unified Continuous Tokenizer
Captured source
source ↗Ming-UniVision: Joint Image Understanding and Generation via a Unified Continuous Tokenizer | INCLUSION AI
Skip to main content GITHUB 🤗 Hugging Face | 🤖 ModelScope
🚀 Technical Highlights
First Continuous Unified Tokenizer for Vision: MingTok seamlessly supports both image understanding and generation within a single continuous latent space—eliminating quantization and bridging modalities.
First NTP-style Autoregressive MLLM with Unified Continuous Visual Tokens: By building on MingTok, Ming-UniVision unifies vision and language under a shared next-token prediction framework, enabling end-to-end autoregressive modeling of diverse vision tasks.
Reduced Representational Competition → 3.5× Faster Convergence: The unified continuous representation aligns semantic understanding and generative dynamics, significantly accelerating joint training without performance trade-offs.
Multi-Round In-Context Learning in a Single Feature Space: All operations—understanding, generation, and editing—occur in the same continuous space, eliminating costly cross-space conversions and enabling simpler, more efficient training and inference.
The Challenge: The Inverse Nature of Seeing and Drawing
Autoregression—the powerful paradigm of modeling the world by “predicting the next token”—has already unified diverse modalities like language and audio. The next frontier is to bring visual understanding (seeing) and visual generation (drawing) into this unified sequence‑to‑sequence framework.
However, this ambition encounters a deep challenge: in many respects, understanding and generation are inverse tasks.
Understanding: Pixels → high‑dimensional, abstract semantic concepts
Generation: Concepts → fine‑grained, high‑fidelity pixels
These tasks have drastically different—and often competing—preferences for their underlying visual representation.
Why Previous Approaches Fell Short
Existing models attempt unification via two limited strategies:
Asymmetric Designs: Use different, heterogeneous feature spaces for each task. During multi‑turn interactions, this forces inefficient “round‑trips” between spaces, causing latency and complexity.
Shared Discrete Tokens: Unify the token space but introduce quantization errors. This hurts image fidelity and degrades understanding capability.
Our Solution: Ming-UniVision and MingTok
To break this impasse, we introduce Ming-UniVision , a new generation of autoregressive vision‑language model built on a foundational innovation: MingTok .
MingTok is the first visual tokenizer based on a continuous latent space. It delivers a truly unified and efficient representation that serves as the bedrock for Ming‑UniVision’s unified NTP (Next‑Token Prediction) framework—harmonizing image understanding, generation, and editing in one in‑context multimodal loop.
The Core Design: A Three-Stage Architecture to Reconcile Competition
At the heart of Ming-UniVision is the MingTok tokenizer, a three-stage sequential architecture elegantly designed to reconcile the competing representational demands of understanding and generation within a single framework.
Figure 1: (a) Existing models use separate visual representations. (b) MingTok, the engine of Ming-UniVision, uses a unified scheme for both semantic and generative representations. (c) This unified approach leads to over 3.5x faster training convergence.
Low-level Encoder: Maps an input image into a sequence of compact, continuous latent codes, optimized for high-quality and efficient autoregressive generation.
Semantic Decoder: Autoregressively "refines" the compact latent codes into high-dimensional, rich semantic features aligned with top-tier understanding models like CLIP.
Pixel Decoder: Serves as a quality-assurance module, ensuring the original image can be reconstructed with high fidelity, guaranteeing a high-fidelity representation process.
The Key Innovation: MingTok creates a unified, differentiable interface. The high-level features for understanding can be directly fed as conditional input for the next round of generation or editing. This completely eliminates the costly detour through pixel space.
The Breakthrough: A Fundamental Leap in Efficiency
By integrating MingTok, Ming-UniVision achieves competitive results on both understanding and generation tasks. The shared continuous latent space unlocks two fundamental layers of efficiency, resolving bottlenecks that have plagued previous architectures.
Figure 2: On general recognition tasks, our method approaches the performance of models with separated representations and significantly outperforms other unified representation models. For generation, our model shows a clear advantage on fine-grained tasks.
1. A Revolution in Training: >3.5x Faster Convergence
Traditional approaches expend massive resources aligning heterogeneous representations, creating an intrinsic "task competition" that slows learning. MingTok solves this at its root.
Synergistic Enhancement: Our ablation studies show that using MingTok for both tasks fosters a synergy where understanding and generation capabilities enhance each other, rather than competing.
>3.5x Speedup: By avoiding inefficient alignment, the model focuses its energy on learning, reaching the same performance level in a fraction of the time compared to traditional schemes.
Figure 3: The performance drop between generation-only training and joint training is minimal with MingTok, proving the advantage of our unified approach.
2. A Revolution in Interaction: Goodbye to the "Pixel Round-Trip"
The efficiency of multi-turn interactions (e.g., generate → edit → re-generate ) depends on the "understanding-generation" loop. This is precisely where traditional architectures falter.
Architecture Type Multi-turn Capability Core Bottleneck Interaction Path Efficiency & Fidelity DiT-based Models ❌ Not Natively Supported Non-autoregressive, stateless N/A (Full process restart) Low Hybrid Architectures ⚠️ Supported, but Inefficient Dual-branch, un-unified spaces Latent → Pixel → Feature Low, complex, lossy Unified AR ⚠️ Supported, but Inefficient Heterogeneous spaces Latent → Pixel → Feature Low, lossy Ming-UniVision ✅ Native & Highly Efficient Unified Continuous Space Feature → Feature High & High-Fidelity
As the table shows, any architecture with separated spaces is doomed to the inefficient Latent → Pixel → Feature round-trip. This "pixel detour" introduces massive latency and causes contextual…
Excerpt shown — open the source for the full document.
Notability
notability 7.0/10Notable research on unified tokenizer