RepoArcee AIArcee AIpublished Jul 31, 2024seen 5d

arcee-ai/DistillKit

Python

Open original ↗

Captured source

source ↗
published Jul 31, 2024seen 5dcaptured 14hhttp 200method plain

arcee-ai/DistillKit

Description: An Open Source Toolkit For LLM Distillation

Language: Python

License: Apache-2.0

Stars: 959

Forks: 126

Open issues: 4

Created: 2024-07-31T22:16:09Z

Pushed: 2026-05-12T20:10:14Z

Default branch: main

Fork: no

Archived: no

README:

DistillKit

A flexible and production-ready toolkit for knowledge distillation of large language models, supporting both online and offline distillation workflows with advanced logit compression.

DistillKit powers the training of many of Arcee's popular open-source models, including Virtuoso, SuperNova Medius, and Blitz.

Features

  • Online Distillation: Real-time teacher inference during student training
  • Offline Distillation: Train from pre-captured teacher outputs with advanced compression
  • Advanced Logit Compression: Novel polynomial approximation + quantization + bit-packing achieving vigorous compression ratios while preserving distillation quality
  • Flexible Loss Functions: Composable losses including KL divergence, JSD, TVD, ranking losses, and hidden state alignment
  • Sparse & Dense Support: Efficient sparse distributions (top-k) or exact dense distributions
  • Battle-tested: The infrastructure powering Arcee's distilled model releases
  • HuggingFace Integration: Built on Transformers, TRL, and Accelerate

Why DistillKit?

While online distillation is straightforward, offline distillation at scale requires careful engineering. Simply storing top-k token-logit pairs becomes prohibitively expensive when distilling on billions of tokens.

DistillKit's compression system is the result of months of experimentation to strike the delicate balance between storage costs, memory throughput, and distillation quality. Our approach:

1. Polynomial approximation of the logit distribution curve 2. Error-diffusion quantization of residuals to preserve quality 3. Bit-level packing with arbitrary bit widths (1-64 bits)

This enables practical offline distillation workflows that would otherwise be infeasible.

Installation

git clone https://github.com/arcee-ai/distillkit.git
cd distillkit
pip install -e .

Optional: Logit Capture

To capture your own teacher outputs, install the capture dependencies:

pip install -e ".[capture]"

For most users, we recommend starting with the pre-captured teacher datasets we provide (see [Datasets](#datasets) below).

Quick Start

Offline Distillation

Train a student model using pre-captured teacher outputs:

# config.yaml
project_name: my-distillation
model: Qwen/Qwen3-8B
output_path: ./output
sequence_length: 8192

dataset:
train_dataset:
repo_id: arcee-ai/Qwen3-235B-Logits-Packed-8192 # Pre-captured teacher outputs
split: train
prepacked: true

teacher:
kind: dataset
logprob_compressor:
d: 151936 # Vocabulary size
delta_encoding: true
error_diffusion: false
exact_dtype: float32
exact_k: 32
k: 128
polynomial_terms: [0, 1, 2]
residual_bins: []
term_dtype: float32

loss_functions:
- function: cross_entropy
weight: 0.5
- function: kl
weight: 0.5
temperature: 1.0
missing_probability_handling: zero
sparse_chunk_length: 1024

training_args:
num_train_epochs: 1
per_device_train_batch_size: 1
gradient_accumulation_steps: 8
learning_rate: 2.0e-6
bf16: true
optim: adamw_torch
gradient_checkpointing: true

Run training:

distillkit config.yaml

Online Distillation

For online distillation where the teacher runs alongside student training, see [examples/afm_test.yml](examples/afm_test.yml) for a complete configuration example.

Core Concepts

Knowledge Distillation for LLMs

Knowledge distillation transfers knowledge from a (potentially larger) "teacher" model to a "student" model. Instead of training only on hard labels (the correct token), the student learns from the teacher's probability distribution over tokens, which is a much richer learning signal.

Key benefits:

  • Smaller, faster models with competitive performance
  • Lower inference costs
  • Easier deployment in resource-constrained environments

Online vs Offline Distillation

Online Distillation:

  • Teacher runs in real-time during student training
  • No storage overhead
  • Best when: You have sufficient VRAM for both models and dense distributions

Offline Distillation:

  • Teacher outputs pre-captured and compressed
  • Enables training multiple students from the same teacher
  • Best when: VRAM-limited, reusing teacher signals, or training at large scale

Rule of thumb: If you can fit both teacher and student with dense distributions into VRAM, use online distillation. Otherwise, offline distillation with our compression system is the way to go.

Sparse vs Dense Distributions

Dense distributions include probabilities for the full vocabulary. This is more accurate but memory-intensive.

Sparse distributions store only the top-k tokens and serve as a lossy, but useful and efficient, approximation of the full dense distribution. With sufficient training data, sparse distillation can achieve equivalent performance to dense.

DistillKit supports both, with automatic chunking for memory-efficient processing of long sequences.

Logit Compression

Our compression system balances storage efficiency with distillation quality:

1. Select top-k logits from teacher output 2. Sort by log-probability, optionally apply delta encoding 3. Fit polynomial to the distribution curve 4. Quantize residuals, with optional error diffusion 5. Bitpack everything into byte vectors

There are lots of knobs you can twiddle here to reach a storage/fidelity tradeoff that works for your particular needs.

Recommended configuration (used at Arcee for new captures):

logprob_compressor:
d:
k: 128
exact_k: 16
exact_dtype: bfloat16
polynomial_terms: [0, 1, 2, 3, 4, "sqrt"]
term_dtype: float32
residual_bins: []
delta_encoding: false
error_diffusion: false

This takes ~300 bytes/token (0.15% of uncompressed distribution size) with minimal quality loss.

If you're a little tight on storage, try the budget pick:

logprob_compressor:
d:
k: 50
exact_k: 1
exact_dtype: bfloat16
polynomial_terms: [0, 1, "sqrt"]
term_dtype: float32
residual_bins: []
delta_encoding: false
error_diffusion: false

This weighs in at around 114 bytes per token, smaller…

Excerpt shown — open the source for the full document.

Notability

notability 6.0/10

New repo with nearly 1k stars.