What does this fork signal mean?

Novita AI forked novitalabs/modded-nanogpt (forked from KellerJordan/modded-nanogpt). This fork signal points to upstream code the lab may be inspecting, patching, or building on. High-signal details: repo novitalabs/modded-nanogpt · parent KellerJordan/modded-nanogpt · Routine fork, no notable traction.. onlylabs links this event to 1 captured evidence page and 6 related fork signals.

Novita AI Fork: novitalabs/modded-nanogpt

Captured source

source ↗

GitHub/github.com/novitalabs/modded-nanogpt

novitalabs/modded-nanogpt repository metadata

Source ↗

published Jun 3, 2026seen Jun 5captured Jun 11http 200method plain

novitalabs/modded-nanogpt

Description: NanoGPT (124M) in 90 seconds

License: MIT

Stars: 0

Forks: 0

Open issues: 0

Created: 2026-06-03T08:52:09Z

Pushed: 2026-06-05T08:50:30Z

Default branch: master

Fork: yes

Parent repository: KellerJordan/modded-nanogpt

Archived: no

README:

Modded-NanoGPT

This repository hosts the *NanoGPT speedrun*, in which we (collaboratively|competitively) search for the fastest algorithm to use 8 NVIDIA H100 GPUs to train a language model that attains 3.28 cross-entropy loss on the FineWeb validation set.

(Note: Besides the main track, there is also an [optimization track](records/track_3_optimization) where we try to minimize steps subject to fixed arch/data/bsz and with unlimited wallclock budget.)

The target (3.28 validation loss on FineWeb) follows Andrej Karpathy's GPT-2 replication in llm.c, which attains that loss after running for 45 minutes. The speedrun code also descends from llm.c's PyTorch trainer, which itself descends from NanoGPT, hence the name of the repo. Thanks to the efforts of many contributors, this repo now contains a training algorithm which attains the target performance in:

Under 90 seconds on 8xH100 (the llm.c GPT-2 replication needed 45 minutes)
under 400M tokens (the llm.c GPT-2 replication needed 10B)

This improvement in training speed has been brought about by the following techniques:

Modernized architecture: Rotary embeddings, QK-Norm, and ReLU²
The Muon optimizer [writeup] [repo]
Use FP8 matmul for head, and asymmetric rescale and softcap logits
Initialization of projections to zero (muP-like)
Skip connections from embedding to every block as well as from block 3 to 6
Extra embeddings which are mixed into the values in attention layers (inspired by Zhou et al. 2024)
Flash Attention 3 with long-short sliding window attention pattern (inspired by Gemma 2) and window size warmup with YaRN
Align training batch starts with EoS and set a max document length
Accumulate gradients for 2 steps for embedding and lm_head before updating parameters
Single activation input for last 3 attention layers
Polar Express implementation in Muon
Smear module to enable 1 token look back
Sparse attention gate
NorMuon
Cautious Weight Decay w/ schedule tied to LR
Exponential decay of residual stream
Batch size schedule
Max seq length schedule
Partial Key Offset
Multi token prediction
Untie embed and lm_head at 2/3 of training
Additional gating on value embeddings and skip connection
Paired head attention
Bigram hash embedding
MUDD skip connections to residual stream and attention values
Learnable XSA

As well as many systems optimizations.

Contributors list (growing with each new record): @bozavlado; @brendanh0gan; @fernbear.bsky.social; @Grad62304977; @jxbz; @kellerjordan0; @KoszarskyB; @leloykun; @YouJiacheng; @jadenj3o; @KonstantinWilleke, @alexrgilbert, @adricarda, @tuttyfrutyee, @vdlad; @ryanyang0, @vagrawal, @classiclarryd, @byronxu99, @varunneal, @EmelyanenkoK, @bernard24/https://www.hiverge.ai/, @Gusarich, @li_zichong, @akash5474, @snimu, @roeeshenberg, @ChrisJMcCormick, @dominikkallusky, @acutkosky, @manikbhandari, @andrewbriand, @jrauvola, @soren_dunn_, @photon_mz, @srashedll, @dhrvji, @EmmettBicker, @dualverse-ai, @sisovicm, @moof2x, @samacqua, @Lisennlp, @_djdumpling

---

Running the current record

To run the current record, run the following commands.

git clone https://github.com/KellerJordan/modded-nanogpt.git && cd modded-nanogpt
pip install -r requirements.txt
# downloads only the first 900M training tokens to save time
python data/cached_fineweb10B.py 9
./run.sh

Add torchrun to path if ./run.sh gives error torchrun: command not found.

Note: torch.compile will add around 7 minutes of latency the first time you run the code.

Official records are timed on 8 NVIDIA H100 GPUs from https://app.primeintellect.ai/. PrimeIntellect has generously sponsored recent validation runs.

Alternative: Running with Docker (recommended for precise timing)

For cases where CUDA or NCCL versions aren't compatible with your current system setup, Docker can be a helpful alternative. This approach standardizes versions for CUDA, NCCL, CUDNN, and Python, reducing dependency issues and simplifying setup. Note: an NVIDIA driver must already be installed on the system (useful if only the NVIDIA driver and Docker are available).

git clone https://github.com/KellerJordan/modded-nanogpt.git && cd modded-nanogpt
sudo docker build -t modded-nanogpt .
sudo docker run -it --rm --gpus all -v $(pwd):/modded-nanogpt modded-nanogpt python data/cached_fineweb10B.py 8
sudo docker run -it --rm --gpus all -v $(pwd):/modded-nanogpt modded-nanogpt sh run.sh

To get an interactive docker, you can use...

Excerpt shown — open the source for the full document.

Notability

notability 2.0/10

Routine fork, no notable traction.