What does this repo signal mean?

ByteDance (Doubao/Seed) published ByteDance-Seed/SwiftSpec (Python). This repository signal exposes tooling, eval, infrastructure, or model-adjacent work before it may appear in a launch post. High-signal details: repo ByteDance-Seed/SwiftSpec · language Python · New repo, low stars, routine fork/job. onlylabs links this event to 1 captured evidence page and 6 related repo signals.

ByteDance (Doubao/Seed) Repo: ByteDance-Seed/SwiftSpec

Captured source

source ↗

GitHub/github.com/ByteDance-Seed/SwiftSpec

ByteDance-Seed/SwiftSpec repository metadata

Source ↗

published Nov 26, 2025seen Jun 5captured Jun 11http 200method plain

ByteDance-Seed/SwiftSpec

Description: This is a minimal artifact with state-of-the-art speculative decoding as described in the SwiftSpec paper [ASPLOS' 26].

Language: Python

License: Apache-2.0

Stars: 10

Forks: 0

Open issues: 0

Created: 2025-11-26T03:35:19Z

Pushed: 2026-03-24T07:18:48Z

Default branch: main

Fork: no

Archived: no

README:

SwiftSpec: Disaggregated Speculative Decoding and Fused Kernels for Low-Latency LLM Inference

Paper link: [ASPLOS'26]

![overview](figures/overview.png)

This is a minimal artifact with state-of-the-art speculative decoding as described in the SwiftSpec paper (recently accepted at ASPLOS 2026 Summer cycle!).

Highlighted results Achieving 369 tokens/second on average serving Llama3.3 70B int4-AWQ model under a Nvidia 8xH800 GPU node!

Features

Disaggregated tree generation: Support for both parallel tree generation (as in SwiftSpec) and serial tree generation (as in SpecExec).
Latency-optimized kernels: a set of latency-optimized kernel, which performs well under low batch size, especially under small models.
Auto-pad for arbitrary Tensor Parallelism: Adding padding for model weights to support any even degree tensor parallelism for supported model
Support for Qwen and LLama model: Supports models including Llama3/deepseek-coder/Qwen2/DeepSeek-R1-Distill-Qwen/DeepSeek-R1-Distill-Llama

Performance

Swiftspec Performance Examples

Serving Llama3.3-70B-Instruct INT4-AWQ on 8xH800:

![Swiftspec on 8xH800](figures/swiftspec.gif)

📋 Table of Contents

[Installation and Quick Start](#installation-and-quick-start)
[Prerequisites](#prerequisites)
[0. Install environment](#0-install-environment)
[1. Download huggingface Models and compare AWQ checkpoints](#1-download-huggingface-models-and-compare-awq-checkpoints)
[2. Convert models into tensor parallel checkpoints](#2-convert-models-into-tensor-parallel-checkpoints)
[3. Run single request demo](#3-run-single-request-demo)
[Model Support](#model-support)
[Performance Results](#performance-results)
[Inference Speed (Tokens/sec) on an 8xH800 GPU node](#inference-speed-tokenssec-on-an-8xh800-gpu-node)
[Acknowledgement](#acknowledgement)
[Citation](#citation)

Installation and Quick Start

Prerequisites

Python 3.10
CUDA 12.4
H800 GPU

0. Install environment

git submodule init
git submodule update

# install packages
conda create -n awq python==3.10 -y
conda activate awq

pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu124

pip install --upgrade pip

cd swiftspec
pip install --use-pep517 -e .
pip install ninja
pip install flash-attn==2.7.3 --no-build-isolation

# install AWQ kernels
cd awq/kernels
python setup.py install

1. modify the path in the [exp_configs.py](tinychat/utils/exp_configs.py)

awq_prefix = "awq_cache/" # You don't have to change this
model_path_prefix="/path/to/huggingface/model" # change this to the path to the downloaded huggingface model
ckpt_prefix = "/root/workspace/models/" # change this to any path that you want to store the SwiftSpec ckpt

2. Download huggingface Models && Convert models into tensor parallel checkpoints

cd scripts
python prepare_models.py llama3.3

3. Run single request demo

cd scripts
# launch demo on webpage
# If you are launching the demo on a ssh-connected GPU server, consider using tools like proxychains to forward to port to your local computer to access (e.g. proxychains4 -f proxychains.conf ssh -L 7860:0.0.0.0:7860 [ssh_name])
python web-demo.sh llama3.3

# run on all queries.
# to get the data, copy https://github.com/SafeAILab/EAGLE/tree/main/eagle/data folder to the this repo (in the repo root directory)
python bench_exp.py

Model Support

Supported target/draft models

| Model Family | Sizes | Example Script | |-------------|-------|----------------| | Llama3/Llama3.3 |1B/3B/8B/70B| [python prepare_models.py llama3.3](scripts/prepare_models.py) && [python web-demo.py llama3.3](scripts/web-demo.py) | | deepseek-coder |1.3b/6.7b/33b| [python prepare_models.py deepseek](scripts/prepare_models.py) && [python web-demo.py deepseek](scripts/web-demo.py) | | Qwen2-72B | 0.5B/1.5B/7B/72B | [python prepare_models.py qwen](scripts/prepare_models.py) && [python web-demo.py qwen](scripts/web-demo.py) | | DeepSeek-R1-Distill-Qwen | 1.5B/7B/32B | [python prepare_models.py r1qwen](scripts/prepare_models.py) && [python web-demo.py r1qwen](scripts/web-demo.py) | | DeepSeek-R1-Distill-Llama | 8B/70B | [python prepare_models.py r1llama](scripts/prepare_models.py) && [python web-demo.py r1llama](scripts/web-demo.py) |

Performance Results

Inference Speed (Tokens/sec) on an 8xH800 GPU node

| Model | Draft model | Precision | Depth | Target TP | Tokens per second | |-------|-----------|------|----------|-------------|--------| | Llama-3.3-70b-Instruct | Llama-3.2-3B | bf16 | 6 | 4 | 369 | Llama-3-70b-Instruct | Llama-3.2-3B | bf16 | 6 | 4 | 347 | deepseek-coder-33b-instruct | deepseek-coder-1.3b-instruct | bf16 | 5 | 6 | 472 | Qwen2-72B-Instruct | Qwen2-1.5B-Instruct | bf16 | 5 | 6 | 274 | DeepSeek-R1-Distill-Qwen-32B | DeepSeek-R1-Distill-Qwen-1.5B | bf16 | 5 | 4 | 317 | DeepSeek-R1-Distill-Llama-70B | DeepSeek-R1-Distill-Llama-8B | bf16 | 5 | 4 | 268

Performance are measured across 6 datasets (same way EAGLE series is evaluated)

Acknowledgement

Thanks to:

llm-awq project, which a large part of our single model inference code relies on
EAGLE project, from which we adapted the verification of the speculative decoding methods

Citation

If you find Swiftspec useful in your research, please cite our paper:

@misc{zhang2025swiftspecultralowlatencyllm,
title={SwiftSpec: Ultra-Low Latency LLM Decoding by Scaling Asynchronous Speculative Decoding},
author={Ziyi Zhang and Ziheng Jiang and Chengquan Jiang and Menghan Yu and Size Zheng and Haibin Lin and Henry Hoffmann and Xin Liu},
year={2025},
eprint={2506.11309},
archivePrefix={arXiv},
primaryClass={cs.DC},
url={https://arxiv.org/abs/2506.11309},
}

Notability

notability 3.0/10

New repo, low stars, routine fork/job