ForkFriendliAIFriendliAIpublished Jul 9, 2025seen 5d

friendliai/flashinfer

forked from flashinfer-ai/flashinfer

Open original ↗

Captured source

source ↗
published Jul 9, 2025seen 5dcaptured 14hhttp 200method plain

friendliai/flashinfer

Description: FlashInfer: Kernel Library for LLM Serving

License: Apache-2.0

Stars: 0

Forks: 0

Open issues: 0

Created: 2025-07-09T05:28:48Z

Pushed: 2025-07-08T20:00:25Z

Default branch: main

Fork: yes

Parent repository: flashinfer-ai/flashinfer

Archived: no

README:

Kernel Library for LLM Serving

| Blog | Documentation | Slack | Discussion Forum |

![Build Status](https://ci.tlcpack.ai/job/flashinfer-ci/job/main/) ![Release](https://github.com/flashinfer-ai/flashinfer/actions/workflows/release_wheel.yml) ![Documentation](https://github.com/flashinfer-ai/flashinfer/actions/workflows/build-doc.yml)

FlashInfer is a library and kernel generator for Large Language Models that provides high-performance implementation of LLM GPU kernels such as FlashAttention, SparseAttention, PageAttention, Sampling, and more. FlashInfer focuses on LLM serving and inference, and delivers state-of-the-art performance across diverse scenarios.

Check our v0.2 release blog for new features!

The core features of FlashInfer include: 1. Efficient Sparse/Dense Attention Kernels: Efficient single/batch attention for sparse(paged)/dense KV-storage on CUDA Cores and Tensor Cores (both FA2 & FA3) templates. The vector-sparse attention can achieve 90% of the bandwidth of dense kernels with same problem size. 2. Load-Balanced Scheduling: FlashInfer decouples plan/run stage of attention computation where we schedule the computation of variable-length inputs in plan stage to alleviate load-imbalance issue. 3. Memory Efficiency: FlashInfer offers Cascade Attention for hierarchical KV-Cache, and implements Head-Query fusion for accelerating Grouped-Query Attention, and efficient kernels for low-precision attention and fused-RoPE attention for compressed KV-Cache. 4. Customizable Attention: Bring your own attention variants through JIT-compilation. 5. CUDAGraph and torch.compile Compatibility: FlashInfer kernels can be captured by CUDAGraphs and torch.compile for low-latency inference. 6. Efficient LLM-specific Operators: High-Performance fused kernel for Top-P, Top-K/Min-P sampling without the need to sorting.

FlashInfer supports PyTorch, TVM and C++ (header-only) APIs, and can be easily integrated into existing projects.

News

  • [Mar 10, 2025] Blog Post Sorting-Free GPU Kernels for LLM Sampling, which explains the design of sampling kernels in FlashInfer.
  • [Mar 1, 2025] Checkout flashinfer's intra-kernel profiler for visualizing the timeline of each threadblock in GPU kernels.
  • [Dec 16, 2024] Blog Post FlashInfer 0.2 - Efficient and Customizable Kernels for LLM Inference Serving
  • [Sept 2024] We've launched a Slack workspace for Flashinfer users and developers. Join us for timely support, discussions, updates and knowledge sharing!
  • [Jan 31, 2024] Blog Post Cascade Inference: Memory-Efficient Shared Prefix Batch Decoding
  • [Jan 31, 2024] Blog Post Accelerating Self-Attentions for LLM Serving with FlashInfer

Getting Started

Using our PyTorch API is the easiest way to get started:

Install from PIP

We provide prebuilt python wheels for Linux. Install FlashInfer with the following command:

# For CUDA 12.6 & torch 2.6
pip install flashinfer-python -i https://flashinfer.ai/whl/cu126/torch2.6
# For other CUDA & torch versions, check https://docs.flashinfer.ai/installation.html

To try the latest features from the main branch, use our nightly-built wheels:

pip install flashinfer-python -i https://flashinfer.ai/whl/nightly/cu126/torch2.6

For a JIT version (compiling every kernel from scratch, NVCC is required), install from PyPI:

pip install flashinfer-python

Install from Source

Alternatively, build FlashInfer from source:

git clone https://github.com/flashinfer-ai/flashinfer.git --recursive
cd flashinfer
python -m pip install -v .

# for development & contribution, install in editable mode
python -m pip install --no-build-isolation -e . -v

To pre-compile essential kernels ahead-of-time (AOT), run the following command:

# Set target CUDA architectures
export TORCH_CUDA_ARCH_LIST="7.5 8.0 8.9 9.0a 10.0a"
# Build AOT kernels. Will produce AOT kernels in aot-ops/
python -m flashinfer.aot
# Build AOT wheel
python -m build --no-isolation --wheel
# Install AOT wheel
python -m pip install dist/flashinfer-*.whl

For more details, refer to the Install from Source documentation.

Trying it out

Below is a minimal example of using FlashInfer's single-request decode/append/prefill attention kernels:

import torch
import flashinfer

kv_len = 2048
num_kv_heads = 32
head_dim = 128

k = torch.randn(kv_len, num_kv_heads, head_dim).half().to(0)
v = torch.randn(kv_len, num_kv_heads, head_dim).half().to(0)

# decode attention

num_qo_heads = 32
q = torch.randn(num_qo_heads, head_dim).half().to(0)

o = flashinfer.single_decode_with_kv_cache(q, k, v) # decode attention without RoPE on-the-fly
o_rope_on_the_fly = flashinfer.single_decode_with_kv_cache(q, k, v, pos_encoding_mode="ROPE_LLAMA") # decode with LLaMA style RoPE on-the-fly

# append attention
append_qo_len = 128
q = torch.randn(append_qo_len, num_qo_heads, head_dim).half().to(0) # append attention, the last 128 tokens in the KV-Cache are the new tokens
o = flashinfer.single_prefill_with_kv_cache(q, k, v, causal=True) # append attention without RoPE on-the-fly, apply causal mask
o_rope_on_the_fly =…

Excerpt shown — open the source for the full document.

Notability

notability 2.0/10

Routine fork, no notable traction