What does this repo signal mean?

Meituan (LongCat) published meituan-longcat/SGLang-FluentLLM (Python). This repository signal exposes tooling, eval, infrastructure, or model-adjacent work before it may appear in a launch post. High-signal details: repo meituan-longcat/SGLang-FluentLLM · language Python · New fork with moderate stars. onlylabs links this event to 1 captured evidence page and 6 related repo signals.

Meituan (LongCat) Repo: meituan-longcat/SGLang-FluentLLM

Captured source

source ↗

GitHub/github.com/meituan-longcat/SGLang-FluentLLM

meituan-longcat/SGLang-FluentLLM repository metadata

Source ↗

published Feb 3, 2026seen Jun 5captured Jun 11http 200method plain

meituan-longcat/SGLang-FluentLLM

Language: Python

License: Apache-2.0

Stars: 83

Forks: 6

Open issues: 2

Created: 2026-02-03T13:35:56Z

Pushed: 2026-04-29T08:41:09Z

Default branch: main

Fork: no

Archived: no

README:

SGLang-FluentLLM

The LongCat series models have consistently followed the principle of Model–System Co-Design, which introduces unique challenges for both the training and inference systems. To help the community better adopt and use LongCat models, we are open-sourcing part of our inference engine (SGLang-FluentLLM) as well as several key kernels.

Engine

Our inference engine is built on top of the SGLang codebase, with the following enhanced capabilities:

Refactored the speculative decoding workflow to make it compatible with overlap scheduling
Combined Target + Verify + Draft into a single CUDA graph to reduce speculative decoding overhead
Support for Eagle, MTP, and PLD style speculative decoding
Layer-wise KVCache transfer, overlapping prefill computation with KVCache communication
Decode Radix Tree Cache to reduce KVCache transfer volume between PDs

We sincerely appreciate the solid work and inspiration brought by the SGLang community.

Kernels

On the kernels side, we are open-sourcing:

FlashMLA SwapAB optimizations
FlashMLA FP8 KVCache + FP8 Compute optimizations
This optimization is detailed in the paper **SnapMLA: Efficient Long-Context MLA Decoding via Hardware-Aware FP8 Quantized Pipelining**.
DeepGemm SwapAB Offset + PDL optimizations
Communication–computation fused kernels optimizations in FlashInfer

We would also like to thank the broader LLM inference community. It is an honor for us to grow together with this community.

Note

We use Dynamo for KVCache-aware request scheduling. As a result, in SGLang-FluentLLM we have removed SGLang’s sgl-model-gateway.
For multimodal models, we adopt a decoupled architecture that differs from the one used in the SGLang community. Therefore, multimodal support has also been removed from SGLang-FluentLLM itself (even in our internal setup, SGLang-FluentLLM is still used as the LLM backbone for multimodal inference).
Tested on Nvidia GPUs H800/H20.

How to Use

Please refer to Quick Start

Notability

notability 4.0/10

New fork with moderate stars