NVIDIA/TensorRT-LLM v1.3.0rc18
NVIDIA/TensorRT-LLM
Captured source
source ↗published Jun 10, 2026seen 1dcaptured 1dhttp 200method plain
v1.3.0rc18
Repository: NVIDIA/TensorRT-LLM
Tag: v1.3.0rc18
Published: 2026-06-10T00:10:37Z
Prerelease: yes
Release notes:
- Known Issues
- DSV3.2 will crash with an IMA in various long-running perf tests on GB200/GB300 when the CuteDSL MoE backend is used. Work around this issue by using another MoE backend.
- Model Support
- Support Nemotron-H NVFP4 checkpoint on Hopper (#14775)
- Add Qwen image support (#13449)
- Support Step-3.7-Flash model (#14711)
- Add Cosmos3-Nano and Cosmos3-Super support (#14824)
- Add AFMoE Trinity support (#13148)
- API
- Add logprobs_simple_format option to return logprobs as a flat
list[float](#13972) trtllm-serve,trtllm-eval,trtllm-bench: Make CLI flags take precedence over--config/--extra_llm_api_optionsYAML (#14812)
- Feature
- Upgrade NIXL to v1.0.1 and UCX to 1.21 (#14436)
- Refactor DWDP from CUDA IPC to CUDA VMM + MNNVL (#14453)
- Enable FlashInfer GDN decoding kernel for Qwen3.5 (#13645)
- Add per-expert LoRA support with Cutlass backend (#14801)
- Reduce OpenAI stream postprocess overhead (#14708)
- Add encoder CUDA graph support to
llm.encode()(#14326) - Use a Triton kernel for C++ mamba hybrid state update (#14869)
- Fuse masked gather + finalize-scale into one Triton kernel in DeepGemmFusedMoE (#14592)
- Support KVCacheManagerV2
adjust()in single GPU + agg PyExecutor loop (#14578) - Add disk cache config for KVCacheManagerV2 (#14845)
- Add Wan I2V generation example (#14981)
- Add LTX-2 visual generation example (#14976)
- Update flashinfer-python from 0.6.12rc2 to 0.6.12 (#14805)
- Fix
- Fix
mamba-out-of-blockerror with ADP + BS=1 + disagg (#14853) - Fix XQA IMA for invalid pages with sliding window (#14459)
- Propagate event loop errors to
await_responsescallers (#12735) - Fix Mamba replay mode accuracy issues (#14509)
- Fix PyExecutor hang in disagg TP prefill (#14020)
- Fix stale runtime metadata issues during MLA fallback transitions (#14049)
- Fix KVCacheManagerV2 block counting correctness issues (#14725)
- Canonicalize multimodal cache-key serialization to prevent hash collisions (#14800)
- Fix LTX-2 audio PE padding issues (#14818)
- Release KVCacheManagerV1 blocks on MAX_UTILIZATION pause (#14723)
- Fix config sharing issue for Qwen3-VL (#14766)
- Enforce request and buffer index lifecycle integrity (#14768)
- Add nemotron-v3 as the proper nemotron-h reasoning parser (#14900)
- Clamp KV pool window sizes to
max_seq_len(#14905) - Fix mamba block calculation (#14524)
- Add
trust_remote_code=Trueto theLLM(...)constructor to fix various model loading issues (#14892) - Fix deep EP partial warp sync for GPT-OSS shapes (#14977)
- Add warmup for trtllm-gen fmha JIT kernels (#14851)
- Documentation
- Add VisualGen API walkthrough example and docs page (#14685)
- Add Nemotron 3 Ultra doc (#14964, #15113)
- Test & Infra
- Pipe stderr separately in subprocess calls to improve error reporting in Allure (#14750)
- Remove obsolete tests (#14995, #14660, #14992, #14952, #14749)
- Parallelize post stages: Rerun Report, Test Coverage, and AI Failure Analysis (#14528)
- Relocate tests to right-sized stages (#14684)
- Move non-default-feature tests to post merge (#15038)
What's Changed
- [None][test] Update datasets path by @JennyLiu-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/14671
- [None][infra] Update new .test_durations by @EmmaQiaoCh in https://github.com/NVIDIA/TensorRT-LLM/pull/14661
- [TRTLLM-13015][feat] drop complex visual_gen CLI example scripts by @zhenhuaw-me in https://github.com/NVIDIA/TensorRT-LLM/pull/14632
- [https://nvbugs/6117811][fix] Fix XQA IMA for invalid pages with sliding window by @pengbowang-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/14459
- [None][feat] Tune mamba config by env variables by @Wanli-Jiang in https://github.com/NVIDIA/TensorRT-LLM/pull/14730
- [None][test] Update moe backend for ctx and acceptance length env by @fredricz-20070104 in https://github.com/NVIDIA/TensorRT-LLM/pull/14803
- [None][test] Update precision of previous device step time by @fredricz-20070104 in https://github.com/NVIDIA/TensorRT-LLM/pull/14809
- [None][infra] Waive 12 failed cases for main in post-merge 2749 by @ZhanruiSunCh in https://github.com/NVIDIA/TensorRT-LLM/pull/14802
- [TRTLLM-12971][infra] Fix parse classname logic in timeout result by @yiqingy0 in https://github.com/NVIDIA/TensorRT-LLM/pull/14559
- [https://nvbugs/6038228][fix] Propagate event loop errors to await_responses callers by @JunyiXu-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/12735
- [TRTLLM-12288][feat] Support Nemotron-H nvfp4 ckpt on Hopper by @JadoTu in https://github.com/NVIDIA/TensorRT-LLM/pull/14775
- [TRTLLM-12596][feat] Support simple logprob format by @tongyuantongyu in https://github.com/NVIDIA/TensorRT-LLM/pull/13972
- [None][fix] Stabilize Mamba replay state update by @sunnyqgg in https://github.com/NVIDIA/TensorRT-LLM/pull/14509
- [None][feat] Upgrade NIXL to v1.0.1 and UCX to 1.21 by @chuangz0 in https://github.com/NVIDIA/TensorRT-LLM/pull/14436
- [None][feat] Refactor DWDP from CUDA IPC to CUDA VMM + MNNVL composite VA by @tianyuz-nv in https://github.com/NVIDIA/TensorRT-LLM/pull/14453
- [TRTLLM-10947][perf] eagle3: use cudaMemcpy2DAsync custom op for hidden-state capture by @pcicotti in https://github.com/NVIDIA/TensorRT-LLM/pull/14479
- [None][fix] PyExecutor Hang in Disagg TP Prefill by @jthomson04 in https://github.com/NVIDIA/TensorRT-LLM/pull/14020
- [https://nvbugs/6240561][fix] Autodeploy fix the deepseek accuracy drop by @nvchenghaoz in https://github.com/NVIDIA/TensorRT-LLM/pull/14774
- [#12702][feat] Autodeploy deprecate the legacy triton attention by @nvchenghaoz in https://github.com/NVIDIA/TensorRT-LLM/pull/14194
- [None][test] Waive 5 failed cases for main in QA CI by @tensorrt-cicd in https://github.com/NVIDIA/TensorRT-LLM/pull/14789
- [None][test] Waive 7 failed cases for main in QA CI by @tensorrt-cicd in https://github.com/NVIDIA/TensorRT-LLM/pull/14791
- [https://nvbugs/6240561][fix] Fix AutoDeploy DeepSeek-R1 accuracy drop by @taylor-yb-lee in https://github.com/NVIDIA/TensorRT-LLM/pull/14793
- [#14588][fix] [AutoDeploy] Fix OOM of DeepSeek-R1 NVFP4 for tp=4 by @taylor-yb-lee in https://github.com/NVIDIA/TensorRT-LLM/pull/14477
- [https://nvbugs/6179761][fix] Save LTX-2 BF16 weights to speed up perf by @yibinl-nvidia in https://github.com/NVIDIA/TensorRT-LLM/pull/14639
- [TRTLLM-13028][doc] Add VisualGen API walkthrough example and docs page by…
Excerpt shown — open the source for the full document.
Notability
notability 3.0/10Routine release candidate of an optimization library.