ReleaseNVIDIANVIDIApublished Jun 10, 2026seen 6h

NVIDIA/cudnn-frontend v1.25.0

NVIDIA/cudnn-frontend

Open original ↗

Captured source

source ↗
published Jun 10, 2026seen 6hcaptured 6hhttp 200method plain

v1.25.0 release

Repository: NVIDIA/cudnn-frontend

Tag: v1.25.0

Published: 2026-06-10T21:11:51Z

Prerelease: no

Release notes:

cuDNN Frontend v1.25.0 Release Notes

cuDNN has moved completely to github for development. Please direct your PRs to develop and file issues in github.

cuDNN Frontend v1.25.0 is the recommended version for cuDNN 9.23.0 and later releases.

Updates to Graph API 🚀 🚀

SDPA

  • `cu_seqlens` in unified SDPA — the unified SDPA path now accepts cumulative sequence-length tensors, enabling variable-length (packed) batches without padding.
  • Ragged offset multiplier — added frontend support for the per-tensor ragged offset multiplier (CUDNN_ATTR_TENSOR_RAGGED_OFFSET_MULTIPLIER), letting ragged offsets be stored in coarser units and scaled back to element offsets by the engine. Exposed through Tensor_attributes (getters/setters, validation, serialization) and the Python tensor() bindings. Requires cuDNN 9.24.0.

Structured plan pinning

  • Added get_engine_and_knobs_at_index, which returns the structured (engine_id, {KnobType_t: value}) for a plan instead of a stringified tag, so a tuned plan can be persisted and replayed exactly via create_execution_plan(engine_id, knobs) even as plan enumeration drifts across versions. Available in C++ (Graph, Execution_plan_list) and Python.
  • Extended KnobType_t with SWAP_AB, INPUT_TMA_ENABLE, and OUTPUT_TMA_ENABLE.

Reduction

  • Added optional group_offset support to the reduction node (Reduction_attributes::set_group_offset), so cuDNN FE can express per-expert reductions for MoE grouped GEMM workloads. Wires CUDNN_ATTR_OPERATION_REDUCTION_GROUP_OFFSET_DESC with runtime version checks (cuDNN ≥ 9.24.0), and exposes the optional argument through the Python reduction binding.

Open-Source Kernels 🚀 🚀

  • Row-scale grouped GEMM quantization — added row-scale support to the grouped GEMM quant path.
  • DSA — fixed CuTe-DSL guards and added the SM90 indexer-forward kernel.
  • dgeglu — config values are now compile-time constants instead of runtime values.

General Improvements ✨✨

  • Static linking of libcudnn is now supported.
  • libcudart loading — the selected libcudart can be overridden via CUDNN_FRONTEND_CUDART_LIB_NAME, and the shim now warns instead of throwing when multiple libcudart libraries are found, improving robustness in containerized environments.
  • Windows / MSVC — consolidated getenv access and fixed C4996/C4005 compiler warnings on MSVC.

Bug Fixes 🐛

  • Fixed variant-pack-template lifecycle bugs and added defensive null checks.
  • Deserialize-owned containers are now cleared on re-deserialize to prevent stale state.
  • Use a static signature for sfd_col_d_srelu_tensor.

Samples

  • Skip TensorIR MemBound / compile-time-const samples on consumer Blackwell (SM12x).
  • Skip the flexible-graph SDPA backward sample on SM120 and above.

Benchmarking 📊

  • Added an autoregressive video DiT SDPA configuration with GB200 / GB300 results.
  • Updated the SDPA benchmarking artifacts and removed stale H200 artifacts.

Acknowledgements

External contributors

  • Thanks @take-cheeze for adding support for static linking of libcudnn.
  • Thanks Ziang Li for adding row-scale support to the grouped GEMM quant path.
  • Thanks Jiayu Sun — DSA CuTe-DSL guard fixes and the SM90 indexer-forward kernel.