RepoDeepSeekDeepSeekpublished Feb 13, 2025seen 6d

deepseek-ai/DeepGEMM

Cuda

Open original ↗

Captured source

source ↗
published Feb 13, 2025seen 6dcaptured 13hhttp 200method plain

deepseek-ai/DeepGEMM

Description: DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling

Language: Cuda

License: MIT

Stars: 7364

Forks: 1040

Open issues: 78

Created: 2025-02-13T09:09:21Z

Pushed: 2026-06-04T06:01:18Z

Default branch: main

Fork: no

Archived: no

README:

DeepGEMM

DeepGEMM is a unified, high-performance tensor core kernel library that brings together the key computation primitives of modern large language models — GEMMs (FP8, FP4, BF16), fused MoE with overlapped communication (Mega MoE), MQA scoring for the lightning indexer, HyperConnection (HC), and more — into a single, cohesive CUDA codebase. All kernels are compiled at runtime via a lightweight Just-In-Time (JIT) module, requiring no CUDA compilation during installation.

DeepGEMM leverages some concepts from CUTLASS and CuTe, but avoids heavy reliance on their templates or algebras. The library is designed for simplicity, with only a limited number of core kernel functions, making it a clean and accessible resource for learning NVIDIA GPU kernel optimization techniques.

Despite its lightweight design, DeepGEMM's performance matches or exceeds expert-tuned libraries across various matrix shapes.

News

  • 2026.04.16: Mega MoE, FP8xFP4 GEMM, FP4 Indexer, PDL, faster JIT compilation and more.
  • Please see #304 for more details.
  • For Mega MoE benchmarks, refer to #316.
  • 2025.09.28: DeepGEMM now supports scoring kernels (weighted ReLU MQA logits) for the lightning indexer for DeepSeek v3.2.
  • Please see #200 for more details.
  • 2025.07.20: DeepGEMM now supports both SM90/SM100, and has a full refactor with a low-CPU-overhead JIT CPP module.
  • NVRTC and post-compilation SASS optimization are all disabled.
  • NVRTC will be supported later.
  • As NVCC 12.9 will automatically do the FFMA interleaving, all post optimizations will be no longer supported.
  • Please see #112 for more details.
  • 2025.05.14: DeepGEMM now offers weight gradient kernels for dense and MoE backward! See #95 for details.
  • 2025.05.07: DeepGEMM now supports NVRTC with up to 10x compilation speedup! See #94 for details. Please use DG_JIT_USE_NVRTC=1 to enable it (may have performance loss with some cases).
  • 2025.04.18: DeepGEMM now achieves up to 1550 TFLOPS on H800! See #74, #78, #81, #86 and 340d988 for details.

Quick start

Requirements

  • NVIDIA SM90 or SM100 architecture GPU
  • Python 3.8 or higher
  • Compilers with C++20 support
  • CUDA Toolkit:
  • CUDA 12.3 or higher for SM90
  • We highly recommend 12.9 or higher for the best performance
  • CUDA 12.9 or higher for SM100
  • PyTorch 2.1 or higher
  • CUTLASS 4.0 or higher (could be cloned by Git submodule)
  • {fmt} library (could be cloned by Git submodule)

Development

# Submodule must be cloned
git clone --recursive git@github.com:deepseek-ai/DeepGEMM.git
cd DeepGEMM

# Link some essential includes and build the CPP JIT module
cat develop.sh
./develop.sh

Installation

cat install.sh
./install.sh

Then, import deep_gemm in your Python project, and enjoy!

Interfaces

Notices

This library provides optimized GEMM kernels for NVIDIA GPUs with a naming convention: D = C + A @ B. The input shape layout is NT (non-transposed A, transposed B). While the SM90 implementation supports only the NT memory layout (row-major, col-major), the SM100 implementation supports all memory layouts (NT, TN, NN, TT). For example, fp8_gemm_nt will do a D = C + A @ B.T

For both architectures, the LHS scaling factor is required to have a TMA-aligned and transposed layout. And the data format for the scaling factor of SM90 and SM100 is different:

  • SM90 requires scaling factors in FP32 format.
  • SM100 requires scaling factors in packed UE8M0 format, which packs 4 UE8M0 into a single torch.int.

Please note that operations like input transposition or FP8 casting must be handled separately by the user, please implement or fuse them into prior kernels independently. While the library provides some simple PyTorch utility functions, these may result in slower performance, but our primary focus is on optimizing the GEMM kernels themselves.

Normal dense GEMMs (non-grouped)

To perform a basic non-grouped FP8 GEMM, call the fp8_gemm_{nt, nn, tn, tt} function. For more details, please refer to the function documentation.

Grouped GEMMs (contiguous layout)

Unlike traditional grouped GEMMs in CUTLASS, DeepGEMM groups only the M-axis, while N and K must remain fixed. This design is tailored for scenarios where experts in an MoE model share the same shape. For training forward passes or inference prefilling, where each expert may process a varying number of tokens, we concatenate these tokens into a single tensor, referred to as the "contiguous" layout. Note that each expert segment must be aligned to the GEMM M block size (get_mk_alignment_for_contiguous_layout()). For more information, please refer to the m_grouped_fp8_gemm_{nt, nn}_contiguous function documentation.

We also provide a K-axis-grouped API for MoE weight backward (with M and N must remain fixed), please refer to k_grouped_fp8_gemm_tn_contiguous for more information.

Grouped GEMMs (masked layout)

During the inference decoding phase, when CUDA graph is enabled and the CPU is unaware of the number of tokens each expert receives, we support masked grouped GEMMs. By providing a mask tensor, the kernel computes only the valid portions.

Use m_grouped_fp8_gemm_nt_masked for this purpose and consult the relevant documentation. An example usage is to use the output of low-latency kernels from DeepEP as input.

V3.2…

Excerpt shown — open the source for the full document.

Notability

notability 8.0/10

High-starred repo release from DeepSeek