meituan-longcat/SGLang-FluentLLM
Python
Captured source
source ↗meituan-longcat/SGLang-FluentLLM
Language: Python
License: Apache-2.0
Stars: 83
Forks: 6
Open issues: 2
Created: 2026-02-03T13:35:56Z
Pushed: 2026-04-29T08:41:09Z
Default branch: main
Fork: no
Archived: no
README:
SGLang-FluentLLM
The LongCat series models have consistently followed the principle of Model–System Co-Design, which introduces unique challenges for both the training and inference systems. To help the community better adopt and use LongCat models, we are open-sourcing part of our inference engine (SGLang-FluentLLM) as well as several key kernels.
Engine
Our inference engine is built on top of the SGLang codebase, with the following enhanced capabilities:
- Refactored the speculative decoding workflow to make it compatible with overlap scheduling
- Combined Target + Verify + Draft into a single CUDA graph to reduce speculative decoding overhead
- Support for Eagle, MTP, and PLD style speculative decoding
- Layer-wise KVCache transfer, overlapping prefill computation with KVCache communication
- Decode Radix Tree Cache to reduce KVCache transfer volume between PDs
We sincerely appreciate the solid work and inspiration brought by the SGLang community.
Kernels
On the kernels side, we are open-sourcing:
- FlashMLA SwapAB optimizations
- FlashMLA FP8 KVCache + FP8 Compute optimizations
- This optimization is detailed in the paper **SnapMLA: Efficient Long-Context MLA Decoding via Hardware-Aware FP8 Quantized Pipelining**.
- DeepGemm SwapAB Offset + PDL optimizations
- Communication–computation fused kernels optimizations in FlashInfer
We would also like to thank the broader LLM inference community. It is an honor for us to grow together with this community.
Note
- We use Dynamo for KVCache-aware request scheduling. As a result, in SGLang-FluentLLM we have removed SGLang’s sgl-model-gateway.
- For multimodal models, we adopt a decoupled architecture that differs from the one used in the SGLang community. Therefore, multimodal support has also been removed from SGLang-FluentLLM itself (even in our internal setup, SGLang-FluentLLM is still used as the LLM backbone for multimodal inference).
- Tested on Nvidia GPUs H800/H20.
How to Use
Please refer to Quick Start
Notability
notability 4.0/10New fork with moderate stars