RepoByteDance (Doubao/Seed)ByteDance (Doubao/Seed)published Nov 26, 2025seen 5d

ByteDance-Seed/SwiftSpec

Python

Open original ↗

Captured source

source ↗
published Nov 26, 2025seen 5dcaptured 14hhttp 200method plain

ByteDance-Seed/SwiftSpec

Description: This is a minimal artifact with state-of-the-art speculative decoding as described in the SwiftSpec paper [ASPLOS' 26].

Language: Python

License: Apache-2.0

Stars: 10

Forks: 0

Open issues: 0

Created: 2025-11-26T03:35:19Z

Pushed: 2026-03-24T07:18:48Z

Default branch: main

Fork: no

Archived: no

README:

SwiftSpec: Disaggregated Speculative Decoding and Fused Kernels for Low-Latency LLM Inference

Paper link: [ASPLOS'26]

![overview](figures/overview.png)

This is a minimal artifact with state-of-the-art speculative decoding as described in the SwiftSpec paper (recently accepted at ASPLOS 2026 Summer cycle!).

Highlighted results Achieving 369 tokens/second on average serving Llama3.3 70B int4-AWQ model under a Nvidia 8xH800 GPU node!

Features

  • Disaggregated tree generation: Support for both parallel tree generation (as in SwiftSpec) and serial tree generation (as in SpecExec).
  • Latency-optimized kernels: a set of latency-optimized kernel, which performs well under low batch size, especially under small models.
  • Auto-pad for arbitrary Tensor Parallelism: Adding padding for model weights to support any even degree tensor parallelism for supported model
  • Support for Qwen and LLama model: Supports models including Llama3/deepseek-coder/Qwen2/DeepSeek-R1-Distill-Qwen/DeepSeek-R1-Distill-Llama

Performance

Swiftspec Performance Examples

Serving Llama3.3-70B-Instruct INT4-AWQ on 8xH800:

![Swiftspec on 8xH800](figures/swiftspec.gif)

📋 Table of Contents

  • [Installation and Quick Start](#installation-and-quick-start)
  • [Prerequisites](#prerequisites)
  • [0. Install environment](#0-install-environment)
  • [1. Download huggingface Models and compare AWQ checkpoints](#1-download-huggingface-models-and-compare-awq-checkpoints)
  • [2. Convert models into tensor parallel checkpoints](#2-convert-models-into-tensor-parallel-checkpoints)
  • [3. Run single request demo](#3-run-single-request-demo)
  • [Model Support](#model-support)
  • [Performance Results](#performance-results)
  • [Inference Speed (Tokens/sec) on an 8xH800 GPU node](#inference-speed-tokenssec-on-an-8xh800-gpu-node)
  • [Acknowledgement](#acknowledgement)
  • [Citation](#citation)

Installation and Quick Start

Prerequisites

  • Python 3.10
  • CUDA 12.4
  • H800 GPU

0. Install environment

git submodule init
git submodule update

# install packages
conda create -n awq python==3.10 -y
conda activate awq

pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu124

pip install --upgrade pip

cd swiftspec
pip install --use-pep517 -e .
pip install ninja
pip install flash-attn==2.7.3 --no-build-isolation

# install AWQ kernels
cd awq/kernels
python setup.py install

1. modify the path in the [exp_configs.py](tinychat/utils/exp_configs.py)

awq_prefix = "awq_cache/" # You don't have to change this
model_path_prefix="/path/to/huggingface/model" # change this to the path to the downloaded huggingface model
ckpt_prefix = "/root/workspace/models/" # change this to any path that you want to store the SwiftSpec ckpt

2. Download huggingface Models && Convert models into tensor parallel checkpoints

cd scripts
python prepare_models.py llama3.3

3. Run single request demo

cd scripts
# launch demo on webpage
# If you are launching the demo on a ssh-connected GPU server, consider using tools like proxychains to forward to port to your local computer to access (e.g. proxychains4 -f proxychains.conf ssh -L 7860:0.0.0.0:7860 [ssh_name])
python web-demo.sh llama3.3

# run on all queries.
# to get the data, copy https://github.com/SafeAILab/EAGLE/tree/main/eagle/data folder to the this repo (in the repo root directory)
python bench_exp.py

Model Support

Supported target/draft models

| Model Family | Sizes | Example Script | |-------------|-------|----------------| | Llama3/Llama3.3 |1B/3B/8B/70B| [python prepare_models.py llama3.3](scripts/prepare_models.py) && [python web-demo.py llama3.3](scripts/web-demo.py) | | deepseek-coder |1.3b/6.7b/33b| [python prepare_models.py deepseek](scripts/prepare_models.py) && [python web-demo.py deepseek](scripts/web-demo.py) | | Qwen2-72B | 0.5B/1.5B/7B/72B | [python prepare_models.py qwen](scripts/prepare_models.py) && [python web-demo.py qwen](scripts/web-demo.py) | | DeepSeek-R1-Distill-Qwen | 1.5B/7B/32B | [python prepare_models.py r1qwen](scripts/prepare_models.py) && [python web-demo.py r1qwen](scripts/web-demo.py) | | DeepSeek-R1-Distill-Llama | 8B/70B | [python prepare_models.py r1llama](scripts/prepare_models.py) && [python web-demo.py r1llama](scripts/web-demo.py) |

Performance Results

Inference Speed (Tokens/sec) on an 8xH800 GPU node

| Model | Draft model | Precision | Depth | Target TP | Tokens per second | |-------|-----------|------|----------|-------------|--------| | Llama-3.3-70b-Instruct | Llama-3.2-3B | bf16 | 6 | 4 | 369 | Llama-3-70b-Instruct | Llama-3.2-3B | bf16 | 6 | 4 | 347 | deepseek-coder-33b-instruct | deepseek-coder-1.3b-instruct | bf16 | 5 | 6 | 472 | Qwen2-72B-Instruct | Qwen2-1.5B-Instruct | bf16 | 5 | 6 | 274 | DeepSeek-R1-Distill-Qwen-32B | DeepSeek-R1-Distill-Qwen-1.5B | bf16 | 5 | 4 | 317 | DeepSeek-R1-Distill-Llama-70B | DeepSeek-R1-Distill-Llama-8B | bf16 | 5 | 4 | 268

Performance are measured across 6 datasets (same way EAGLE series is evaluated)

Acknowledgement

Thanks to:

  • llm-awq project, which a large part of our single model inference code relies on
  • EAGLE project, from which we adapted the verification of the speculative decoding methods

Citation

If you find Swiftspec useful in your research, please cite our paper:

@misc{zhang2025swiftspecultralowlatencyllm,
title={SwiftSpec: Ultra-Low Latency LLM Decoding by Scaling Asynchronous Speculative Decoding},
author={Ziyi Zhang and Ziheng Jiang and Chengquan Jiang and Menghan Yu and Size Zheng and Haibin Lin and Henry Hoffmann and Xin Liu},
year={2025},
eprint={2506.11309},
archivePrefix={arXiv},
primaryClass={cs.DC},
url={https://arxiv.org/abs/2506.11309},
}

Notability

notability 3.0/10

New repo, low stars, routine fork/job