What does this writing signal mean?

InclusionAI (Ant Group) published Ming-UniVision: Joint Image Understanding and Generation via a Unified Continuous Tokenizer. This talking signal gives public context for research themes, product direction, policy, or launch framing. High-signal details: Notable research on unified tokenizer · Ming-UniVision: Joint Image Understanding and Generation via a Unified Continuous Tokenizer | INCLUSION AI Skip to main content GITHUB 🤗 Hugging Face ｜ 🤖 ModelScope 🚀.... onlylabs links this event to 1 captured evidence page and 6 related writing signals.

InclusionAI (Ant Group) Writing: Ming-UniVision: Joint Image Understanding and Generation via a Unified Continuous Tokenizer

Captured source

source ↗

inclusion-ai.org/inclusion-ai.org/blog/mingtok

Ming-UniVision: Joint Image Understanding and Generation via a Unified Continuous Tokenizer

Source ↗

published Oct 1, 2025seen Jun 5captured Jun 7http 200method plain

Ming-UniVision: Joint Image Understanding and Generation via a Unified Continuous Tokenizer | INCLUSION AI

Skip to main content GITHUB 🤗 Hugging Face ｜ 🤖 ModelScope

🚀 Technical Highlights

First Continuous Unified Tokenizer for Vision: MingTok seamlessly supports both image understanding and generation within a single continuous latent space—eliminating quantization and bridging modalities.

First NTP-style Autoregressive MLLM with Unified Continuous Visual Tokens: By building on MingTok, Ming-UniVision unifies vision and language under a shared next-token prediction framework, enabling end-to-end autoregressive modeling of diverse vision tasks.

Reduced Representational Competition → 3.5× Faster Convergence: The unified continuous representation aligns semantic understanding and generative dynamics, significantly accelerating joint training without performance trade-offs.

Multi-Round In-Context Learning in a Single Feature Space: All operations—understanding, generation, and editing—occur in the same continuous space, eliminating costly cross-space conversions and enabling simpler, more efficient training and inference.

The Challenge: The Inverse Nature of Seeing and Drawing

Autoregression—the powerful paradigm of modeling the world by “predicting the next token”—has already unified diverse modalities like language and audio. The next frontier is to bring visual understanding (seeing) and visual generation (drawing) into this unified sequence‑to‑sequence framework.

However, this ambition encounters a deep challenge: in many respects, understanding and generation are inverse tasks.

Understanding: Pixels → high‑dimensional, abstract semantic concepts

Generation: Concepts → fine‑grained, high‑fidelity pixels

These tasks have drastically different—and often competing—preferences for their underlying visual representation.

Why Previous Approaches Fell Short

Existing models attempt unification via two limited strategies:

Asymmetric Designs: Use different, heterogeneous feature spaces for each task. During multi‑turn interactions, this forces inefficient “round‑trips” between spaces, causing latency and complexity.

Shared Discrete Tokens: Unify the token space but introduce quantization errors. This hurts image fidelity and degrades understanding capability.

Our Solution: Ming-UniVision and MingTok

To break this impasse, we introduce Ming-UniVision , a new generation of autoregressive vision‑language model built on a foundational innovation: MingTok .

MingTok is the first visual tokenizer based on a continuous latent space. It delivers a truly unified and efficient representation that serves as the bedrock for Ming‑UniVision’s unified NTP (Next‑Token Prediction) framework—harmonizing image understanding, generation, and editing in one in‑context multimodal loop.

The Core Design: A Three-Stage Architecture to Reconcile Competition

At the heart of Ming-UniVision is the MingTok tokenizer, a three-stage sequential architecture elegantly designed to reconcile the competing representational demands of understanding and generation within a single framework.

Figure 1: (a) Existing models use separate visual representations. (b) MingTok, the engine of Ming-UniVision, uses a unified scheme for both semantic and generative representations. (c) This unified approach leads to over 3.5x faster training convergence.

Low-level Encoder: Maps an input image into a sequence of compact, continuous latent codes, optimized for high-quality and efficient autoregressive generation.

Semantic Decoder: Autoregressively "refines" the compact latent codes into high-dimensional, rich semantic features aligned with top-tier understanding models like CLIP.

Pixel Decoder: Serves as a quality-assurance module, ensuring the original image can be reconstructed with high fidelity, guaranteeing a high-fidelity representation process.

The Key Innovation: MingTok creates a unified, differentiable interface. The high-level features for understanding can be directly fed as conditional input for the next round of generation or editing. This completely eliminates the costly detour through pixel space.

The Breakthrough: A Fundamental Leap in Efficiency

By integrating MingTok, Ming-UniVision achieves competitive results on both understanding and generation tasks. The shared continuous latent space unlocks two fundamental layers of efficiency, resolving bottlenecks that have plagued previous architectures.

Figure 2: On general recognition tasks, our method approaches the performance of models with separated representations and significantly outperforms other unified representation models. For generation, our model shows a clear advantage on fine-grained tasks.

1. A Revolution in Training: >3.5x Faster Convergence

Traditional approaches expend massive resources aligning heterogeneous representations, creating an intrinsic "task competition" that slows learning. MingTok solves this at its root.

Synergistic Enhancement: Our ablation studies show that using MingTok for both tasks fosters a synergy where understanding and generation capabilities enhance each other, rather than competing.

>3.5x Speedup: By avoiding inefficient alignment, the model focuses its energy on learning, reaching the same performance level in a fraction of the time compared to traditional schemes.

Figure 3: The performance drop between generation-only training and joint training is minimal with MingTok, proving the advantage of our unified approach.

2. A Revolution in Interaction: Goodbye to the "Pixel Round-Trip"

The efficiency of multi-turn interactions (e.g., generate → edit → re-generate ) depends on the "understanding-generation" loop. This is precisely where traditional architectures falter.

Architecture Type Multi-turn Capability Core Bottleneck Interaction Path Efficiency & Fidelity DiT-based Models ❌ Not Natively Supported Non-autoregressive, stateless N/A (Full process restart) Low Hybrid Architectures ⚠️ Supported, but Inefficient Dual-branch, un-unified spaces Latent → Pixel → Feature Low, complex, lossy Unified AR ⚠️ Supported, but Inefficient Heterogeneous spaces Latent → Pixel → Feature Low, lossy Ming-UniVision ✅ Native & Highly Efficient Unified Continuous Space Feature → Feature High & High-Fidelity

As the table shows, any architecture with separated spaces is doomed to the inefficient Latent → Pixel → Feature round-trip. This "pixel detour" introduces massive latency and causes contextual...

Excerpt shown — open the source for the full document.

Notability

notability 7.0/10

Notable research on unified tokenizer