ForkFriendliAIFriendliAIpublished Jul 8, 2025seen 5d

friendliai/cutlass

forked from NVIDIA/cutlass

Open original ↗

Captured source

source ↗
published Jul 8, 2025seen 5dcaptured 11hhttp 200method plain

friendliai/cutlass

Description: CUDA Templates for Linear Algebra Subroutines

License: NOASSERTION

Stars: 0

Forks: 0

Open issues: 0

Created: 2025-07-08T10:38:38Z

Pushed: 2025-07-06T12:49:22Z

Default branch: main

Fork: yes

Parent repository: NVIDIA/cutlass

Archived: no

README: ![ALT](./media/images/gemm-hierarchy-with-epilogue-no-labels.png "Complete CUDA GEMM decomposition")

Overview

CUTLASS 4.1.0

_CUTLASS 4.1.0 - July 2025_

CUTLASS is a collection of abstractions for implementing high-performance matrix-matrix multiplication (GEMM) and related computations at all levels and scales within CUDA. It incorporates strategies for hierarchical decomposition and data movement. CUTLASS decomposes these "moving parts" into reusable, modular software components and abstractions.

Primitives for different levels of a conceptual parallelization hierarchy can be specialized and tuned via custom tiling sizes, data types, and other algorithmic policy. The resulting flexibility simplifies their use as building blocks within custom kernels and applications.

CUTLASS has been providing CUDA C++ template abstractions for high-performance linear algebra since 2017 and these abstractions provide extensive support for a wide range of computations including mixed-precision computations, specialized data-movement (async copy) and multiply-accumulate abstractions for FP64, FP32, TF32, FP16, BF16, FP32 emulation via tensor core instruction, 8b floating point types (e5m2 and e4m3), block scaled data types (NVIDIA NVFP4 and OCP standard MXFP4, MXFP6, MXFP8), narrow integer types (4 and 8b signed and unsigned integers), and binary 1b data types (where architectures allow for the native support of such data types) across NVIDIA's Volta, Turing, Ampere, Ada, Hopper, and Blackwell architectures.

To this rich ecosystem of C++ based kernel programming abstractions, CUTLASS 4 adds CUTLASS DSLs. These are Python native interfaces for writing high-performance CUDA kernels based on core CUTLASS and CuTe concepts without any performance compromises. This allows for a much smoother learning curve, orders of magnitude faster compile times, native integration with DL frameworks without writing glue code, and much more intuitive metaprogramming that does not require deep C++ expertise.

Overall we envision CUTLASS DSLs as a family of domain-specific languages (DSLs). With the release of 4.0, we are releasing the first of these in CuTe DSL. This is a low level programming model that is fully consistent with CuTe C++ abstractions — exposing core concepts such as layouts, tensors, hardware atoms, and full control over the hardware thread and data hierarchy.

CuTe DSL demonstrates optimal matrix multiply and other linear algebra operations targeting the programmable, high-throughput _Tensor Cores_ implemented by NVIDIA's Ampere, Hopper, and Blackwell architectures.

We believe it will become an indispensable tool for students, researchers, and performance engineers alike — flattening the learning curve of GPU programming, rapidly prototyping kernel designs, and bringing optimized solutions into production.

CuTe DSL is currently in public beta and will graduate out of beta by end of summer 2025.

To get started quickly - please refer :

What's New in CUTLASS 4.1

CuTe DSL

  • More examples demonstrating how to use CuTe DSL to write peak-performance kernels
  • Blackwell Mamba2 SSD
  • API updates
  • for loop
  • Python built-in `range` now always generates IR and executes at runtime
  • cutlass.range is advanced range` with IR level unrolling and pipelining control
  • Deprecated `cutlass.range_dynamic, please replace with range or cutlass.range
  • Experimental Added `pipelining` control for compiler generated software pipeline code
  • while/if
  • while/if now by default generates IR and executes at runtime unless cutlass.const_expr` is specified for the predicate
  • Deprecated `cutlass.dynamic_expr`, please remove it
  • Rename mbarrier functions to reduce ambiguity
  • Modify SyncObject API (MbarrierArray, NamedBarrier, TmaStoreFence) to match std::barrier
  • Change pipeline create function to take only keyword arguments, and make barrier_storage optional.

CUTLASS C++

  • Further enhance Blackwell SM100 Attention kernels in example 77.
  • Add variable sequence length support for FMHA Backward kernel.
  • Add varlen test support to Backward runner.
  • Codes support empty batch sequences.
  • Replace subbyte_iterator with cute::recast_ptr when constructing logical iterators/arrays.
  • CuTe changes:
  • Rewrite ArithTuple and ScaledBasis for robustness and clarity.
  • Remove buggy and kludgy get_layoutA|B|C_MN and friends from Atoms/TiledX.
  • Factor out print_latex and friends and rewrite.
  • Factor out print_svg and friends and rewrite.
  • Support Blackwell SM100 SIMT FFMA2 kernels.
  • Support residual add for implicit gemm kernels.
  • Various fixes for CUTLASS C++ Python interface's EVT tracer:
  • Add verifier for sm90 to report the invalid input.
  • When adding an edge to the graph, if the edge already exists, add an identity compute node to avoid having multiple parallel edges.
  • Register operations of tanh, sigmoid, exp, gelu to the python ast frontend.
  • Replace the NotImplemented Error by packing all nodes into a single topological visitor node as a fallback.
  • Fix profiler bugs in exhaustive perf search.
  • Fix incorrect cluster shape output issue when doing exhaustive search.
  • Fix a bug in profiler grouped GEMM for setting tile scheduler swizzles, cluster shapes, and raster orders.

Note: CUTLASS 4.x builds are known to be down on Windows platforms for all CUDA toolkits. CUTLASS team is working on a fix.

See the [CHANGELOG](https://docs.nvidia.com/cutlass/CHANGELOG.html) for details of all past releases and updates.

Performance

CUTLASS primitives are very efficient. When used to construct device-wide GEMM kernels, they exhibit nearly optimal utilization of peak theoretical…

Excerpt shown — open the source for the full document.

Notability

notability 1.0/10

Routine fork of existing library