NVIDIA/cudnn-frontend v1.25.0
NVIDIA/cudnn-frontend
Captured source
source ↗published Jun 10, 2026seen 6hcaptured 6hhttp 200method plain
v1.25.0 release
Repository: NVIDIA/cudnn-frontend
Tag: v1.25.0
Published: 2026-06-10T21:11:51Z
Prerelease: no
Release notes:
cuDNN Frontend v1.25.0 Release Notes
cuDNN has moved completely to github for development. Please direct your PRs to develop and file issues in github.
cuDNN Frontend v1.25.0 is the recommended version for cuDNN 9.23.0 and later releases.
Updates to Graph API 🚀 🚀
SDPA
- `cu_seqlens` in unified SDPA — the unified SDPA path now accepts cumulative sequence-length tensors, enabling variable-length (packed) batches without padding.
- Ragged offset multiplier — added frontend support for the per-tensor ragged offset multiplier (
CUDNN_ATTR_TENSOR_RAGGED_OFFSET_MULTIPLIER), letting ragged offsets be stored in coarser units and scaled back to element offsets by the engine. Exposed throughTensor_attributes(getters/setters, validation, serialization) and the Pythontensor()bindings. Requires cuDNN 9.24.0.
Structured plan pinning
- Added
get_engine_and_knobs_at_index, which returns the structured(engine_id, {KnobType_t: value})for a plan instead of a stringified tag, so a tuned plan can be persisted and replayed exactly viacreate_execution_plan(engine_id, knobs)even as plan enumeration drifts across versions. Available in C++ (Graph,Execution_plan_list) and Python. - Extended
KnobType_twithSWAP_AB,INPUT_TMA_ENABLE, andOUTPUT_TMA_ENABLE.
Reduction
- Added optional
group_offsetsupport to the reduction node (Reduction_attributes::set_group_offset), so cuDNN FE can express per-expert reductions for MoE grouped GEMM workloads. WiresCUDNN_ATTR_OPERATION_REDUCTION_GROUP_OFFSET_DESCwith runtime version checks (cuDNN ≥ 9.24.0), and exposes the optional argument through the Pythonreductionbinding.
Open-Source Kernels 🚀 🚀
- Row-scale grouped GEMM quantization — added row-scale support to the grouped GEMM quant path.
- DSA — fixed CuTe-DSL guards and added the SM90 indexer-forward kernel.
- dgeglu — config values are now compile-time constants instead of runtime values.
General Improvements ✨✨
- Static linking of libcudnn is now supported.
- libcudart loading — the selected libcudart can be overridden via
CUDNN_FRONTEND_CUDART_LIB_NAME, and the shim now warns instead of throwing when multiple libcudart libraries are found, improving robustness in containerized environments. - Windows / MSVC — consolidated
getenvaccess and fixed C4996/C4005 compiler warnings on MSVC.
Bug Fixes 🐛
- Fixed variant-pack-template lifecycle bugs and added defensive null checks.
- Deserialize-owned containers are now cleared on re-deserialize to prevent stale state.
- Use a static signature for
sfd_col_d_srelu_tensor.
Samples
- Skip TensorIR MemBound / compile-time-const samples on consumer Blackwell (SM12x).
- Skip the flexible-graph SDPA backward sample on SM120 and above.
Benchmarking 📊
- Added an autoregressive video DiT SDPA configuration with GB200 / GB300 results.
- Updated the SDPA benchmarking artifacts and removed stale H200 artifacts.
Acknowledgements
External contributors
- Thanks @take-cheeze for adding support for static linking of libcudnn.
- Thanks Ziang Li for adding row-scale support to the grouped GEMM quant path.
- Thanks Jiayu Sun — DSA CuTe-DSL guard fixes and the SM90 indexer-forward kernel.