RepoLightning AILightning AIpublished Mar 18, 2024seen 5d

Lightning-AI/lightning-thunder

Python

Open original ↗

Captured source

source ↗
published Mar 18, 2024seen 5dcaptured 13hhttp 200method plain

Lightning-AI/lightning-thunder

Description: PyTorch compiler that accelerates training and inference. Get built-in optimizations for performance, memory, parallelism, and easily write your own.

Language: Python

License: Apache-2.0

Stars: 1461

Forks: 114

Open issues: 483

Created: 2024-03-18T15:30:56Z

Pushed: 2026-06-08T20:52:52Z

Default branch: main

Fork: no

Archived: no

README:

Thunder is a source-to-source deep learning compiler for PyTorch that focuses on making it simple to optimize models for training and inference.

It provides:

  • a simple, Pythonic IR capturing the entire computation
  • a rich system of transforms that simultaneously operate on the computation IR, the model, and the weights
  • an extensible dispatch mechanism to fusers and optimized kernel libraries

With Thunder you can:

  • profile deep learning programs easily, map individual ops to kernels and inspect programs interactively
  • programmatically replace sequences of operations with optimized ones and see the effect on performance
  • acquire full computation graphs without graph breaks by flexibly extending the interpreter
  • modify programs to fully utilize bleeding edge kernel libraries on specific hardware
  • write models for single GPU and transform them to run distributed
  • quickly iterate on mixed precision and quantization strategies to search for combinations that minimally affect quality
  • bundle all optimizations in composable recipes, so they can be ported across model families

Ultimately, you should think about Thunder as a highly efficient tool to go from “unoptimized” to “optimized”.

If that is of interest for you, read on to Install Thunder and get started quickly.

 

 

Quick start

Install Thunder via pip (more options):

pip install lightning-thunder

pip install -U torch torchvision
pip install nvfuser-cu128-torch28 nvidia-cudnn-frontend # if NVIDIA GPU is present

For older versions of torch

torch==2.7 + CUDA 12.8

pip install lightning-thunder

pip install torch==2.7.0 torchvision==0.22
pip install nvfuser-cu128-torch27 nvidia-cudnn-frontend # if NVIDIA GPU is present

torch==2.6 + CUDA 12.6

pip install lightning-thunder

pip install torch==2.6.0 torchvision==0.21
pip install nvfuser-cu126-torch26 nvidia-cudnn-frontend # if NVIDIA GPU is present

torch==2.5 + CUDA 12.4

pip install lightning-thunder

pip install torch==2.5.0 torchvision==0.20
pip install nvfuser-cu124-torch25 nvidia-cudnn-frontend # if NVIDIA GPU is present

Advanced install options

Install optional executors

# Float8 support (this will compile from source, be patient)
pip install "transformer_engine[pytorch]"

Install Thunder bleeding edge

pip install git+https://github.com/Lightning-AI/lightning-thunder.git@main

Install Thunder for development

git clone https://github.com/Lightning-AI/lightning-thunder.git
cd lightning-thunder
pip install -e .

Hello world

Define a function or a torch module:

import torch.nn as nn

model = nn.Sequential(nn.Linear(2048, 4096), nn.ReLU(), nn.Linear(4096, 64))

Optimize it with Thunder:

import thunder
import torch

thunder_model = thunder.compile(model)

x = torch.randn(64, 2048)

y = thunder_model(x)

torch.testing.assert_close(y, model(x))

Examples

LLM training

Install LitGPT (without updating other dependencies)

pip install --no-deps 'litgpt[all]'

and run

import thunder
import torch
import litgpt

with torch.device("cuda"):
model = litgpt.GPT.from_name("Llama-3.2-1B").to(torch.bfloat16)

thunder_model = thunder.compile(model)

inp = torch.ones((1, 2048), device="cuda", dtype=torch.int64)

out = thunder_model(inp)
out.sum().backward()

HuggingFace BERT inference

Install Hugging Face Transformers (recommended version is 4.50.2 and above)

pip install -U transformers

and run

import thunder
import torch
import transformers

model_name = "bert-large-uncased"

tokenizer = transformers.AutoTokenizer.from_pretrained(model_name)

with torch.device("cuda"):
model = transformers.AutoModelForCausalLM.from_pretrained(
model_name, torch_dtype=torch.bfloat16
)
model.requires_grad_(False)
model.eval()

inp = tokenizer(["Hello world!"], return_tensors="pt")

thunder_model = thunder.compile(model)

out = thunder_model(**inp)
print(out)

HuggingFace DeepSeek R1 distill inference

Install Hugging Face Transformers (recommended version is 4.50.2 and above)

pip install -U transformers

and run

import torch
import transformers
import thunder

model_name = "deepseek-ai/DeepSeek-R1-Distill-Llama-8B"

tokenizer = transformers.AutoTokenizer.from_pretrained(model_name)

with torch.device("cuda"):
model = transformers.AutoModelForCausalLM.from_pretrained(
model_name, torch_dtype=torch.bfloat16
)
model.requires_grad_(False)
model.eval()

inp = tokenizer(["Hello world! Here's a long story"], return_tensors="pt")

thunder_model = thunder.compile(model)

out = thunder_model.generate(
**inp, do_sample=False, cache_implementation="static", max_new_tokens=100
)
print(out)

Vision Transformer inference

import thunder
import torch
import torchvision as tv

with torch.device("cuda"):
model = tv.models.vit_b_16()
model.requires_grad_(False)
model.eval()

inp = torch.randn(128, 3, 224, 224)

out = model(inp)

thunder_model = thunder.compile(model)

out = thunder_model(inp)

Benchmarks

Although is Thunder a tool for optimizing models, rather than an opaque compiler that gets you speedups out of the box, here is a set of benchmarks.

Perf-wise, out of the box Thunder is in the ballpark of torch compile, especially when using CUDAGraphs. Note however that Thunder is not a competitor to torch compile! It can actually use torch compile as one of its fusion executors.

The script examples/quickstart/hf_llm.py demonstrates how to benchmark a model for text generation, forward pass, forward pass with loss, and a full forward + backward computation.

On an H100 with torch=2.8.0 and nvfuser-cu128-torch28 and Transformers 4.55.4 running Llama 3.2 1B we see the following timings:

Transformers with torch.compile and CUDAGraphs (reduce-overhead mode): 521ms
Transformers with torch.compile but no CUDAGraphs (default mode): 814ms
Transformers without torch.compile: 1493ms…

Excerpt shown — open the source for the full document.

Notability

Community shares details about Thunder compiler's capabilities.