Serving MiniMax-M3 for efficient inference: Unlocking 1M-Token Context and Multimodality Without Regrets
Captured source
source ↗Serving MiniMax-M3 for efficient inference: Unlocking 1M-Token Context and Multimodality Without Regrets
⚡️ FlashAttention-4: up to 1.3× faster than cuDNN on NVIDIA Blackwell →
Introducing Together AI's new look →
🔎 ATLAS: runtime-learning accelerators delivering up to 4x faster LLM inference →
⚡ Together GPU Clusters: self-service NVIDIA GPUs, now generally available →
📦 Batch Inference API: Process billions of tokens at 50% lower cost for most models →
🪛 Fine-Tuning Platform Upgrades: Larger Models, Longer Contexts →
All blog posts
Inference
Published 6/2/2026
Serving MiniMax-M3 for efficient inference: Unlocking 1M-Token Context and Multimodality Without Regrets
Authors
Yubo Wang, Michael Granado, Connor Li, Jue Wang, Brian Mak, Wei Gong, Hiral Jasani, Yineng Zhang, Dan Fu
Table of contents
40+ Models Chosen for Production...40+ Models Chosen for Production...40+ Models Chosen for Production...
Together AI is the preferred cloud partner for MiniMax M3. Together AI will host the open-weights model as a developer endpoint upon its public release. Our Inference and Kernel teams delivered significant engineering breakthroughs to serve M3 efficiently , including key optimizations such as a KV-Block-Major sparse attention kernel, a novel paged attention integration for MSA, highly optimized index scoring kernel and a Rust-based multimodal preprocessing gateway, resulting in 81–125% throughput improvements across different concurrency levels. Serving MiniMax M3 at scale in production validates Together AI as the go-to inference platform for models that push the frontier on the hard systems problems that make real-world deployment possible.
MiniMax launched their latest state-of-the-art model M3 and Together AI is excited to be the preferred cloud partner, enabling MiniMax to efficiently serve M3 in production at scale. Once MiniMax M3 is released as an open weights model over the coming few days, Together AI will also host the model as an endpoint for developers directly. Behind that scale is the exceptional work of our Inference and Kernel teams, who drove deep performance optimizations and ensured production-grade reliability for a model that pushes the frontier: 1M-token context window, native multimodality, and an architecture that demands serious engineering to serve efficiently. In this post, we'll walk through how we made it happen. Congratulations to the MiniMax team on a landmark model launch and continued innovation. MiniMax M3 is an all-in-one model that brings together state-of-the-art coding performance, agentic workflow support, and native multimodal reasoning. On top of these capabilities, it is also designed to support 1M context while being highly economically friendly to serve. This makes it a good fit for real-world tasks where long documents, codebases, tool use, images, and iterative reasoning often appear together and heavy in context. Compared to the previous generation, serving M3 imposes more challenges as new capabilities require optimizations on more dimensions including sparse attention computation, larger KV cache management, multimodal processing, etc. Architecture / Characteristics The most novel architectural change in M3 is MiniMax Sparse Attention (MSA), which is designed to address the attention-computation bottleneck seen in MiniMax M2.7. Its block-sparse attention mechanism caps the maximum number of tokens each query can attend to, reducing the cost of long-context processing and making much longer context windows practical. This brings a speed up of more than 9x in the prefilling stage and more than 15x in the decoding stage.
In essence, MSA’s calculation is composed of two parts: a score calculation to determine the most relevant K blocks to attend to for each KV group, and then dense attention between the query token and those blocks. This design preserves expressiveness along the KV-group dimension while still putting a limit on the maximum number of KV tokens a query token attends to. The attention computation itself no longer scales as N^2 with context length, thus making it very suitable for long context workload. We measured the kernel execution time breakdowns under agentic-style traffic shape(60k prefix cache) under concurrency of 8 on B200. MSA significantly lowers the wall time percent of the actual attention computation per iteration.
A separate kernel execution breakdown under agentic-style traffic with 60K prefix cache, concurrency 8, and NVIDIA B200 showed that MSA significantly reduces the wall-time percentage spent in attention computation per iteration. Besides the attention architecture change, M3 is also shipped with multimodal support with a vision component and new image and video preprocessing functionalities. Given these fundamental changes, Together AI worked closely with MiniMax’s engineering team to tackle the new emerging challenges. Some major challenges include: Though MiniMax sparse attention computation itself is highly efficient, supporting 1M context length is still challenging from an engineering perspective. Video and image processing are natively more complicated than text tokenization.
Optimizations KV-Block-Major Sparse Attention During prefill, attention computation can still be a big factor for long context input, as for each token, we need to calculate Selected Block * KV Head Group * Tokens . The nature of the block sparse attention allows multiple queries to attend to the same key-value blocks. Thus, if we iterate each query to calculate attention with key-value blocks, we are duplicating the KV movement from HBM to SRAM on GPU. Iterating over the key-value group in the outer loop and calculating attention between query tokens in the inner loop allows better arithmetic intensity as KV cache is moved only once. To achieve this, we need to reorganize the mapping from {q, kv block} into {kv block, q} and reimplement the attention kernel. Because we are calculating only partial O output for the kv block, we need a final “reduction” based on the Log-Sum-Exp to rescale output O and sum. The process is as follows:
Integrate MSA with Paged Attention In modern inference engines, paged attention is often used to manage KV cache context for requests. The majority of highly optimized attention kernels are written with a fixed set of page size support. The blocker that stops us from using these kernels is that the selected blocks differ across KV groups. At…
Excerpt shown — open the source for the full document.
Notability
notability 3.0/10Low traction, routine deployment post