WritingArcee AIArcee AIpublished Jun 23, 2025seen 1d

Extending Afm 4 5b To 64k Context Length

Open original ↗

Captured source

source ↗
published Jun 23, 2025seen 1dcaptured 7hhttp 200method plain

Arcee AI | Extending AFM-4.5B to 64k Context Length

Trinity Large Thinking: Available on OpenRouter.

Try now ↗

ENTERPRISE

Research

COMPANY

Get API

Blog / Extending AFM-4.5B to 64k Context Length

Extending AFM-4.5B to 64k Context Length Charles Goddard ,

June 23, 2025

From 4k to 64k context through aggressive experimentation, model merging, distillation, and a concerning amount of soup.

The other day Arcee finally announced the first of our from-scratch foundation models, AFM-4.5B . Learning to train a foundation model is a long and arduous journey, and there are many lessons and learnings that we will be sharing in the full tech report in the coming weeks. In the meantime, I wanted to pull back the curtain on one particular part of the training process: extending the context length. We extended AFM-4.5B from 4k to 64k context through aggressive experimentation, model merging, distillation, and a concerning amount of soup. This post will be a pretty unflattering look at the raw meat of the experimental process and the various approaches we tried, eventually arriving at a final model that performs well on both short and long context tasks. Bon appétit. Disclaimer : AFM-4.5B was recently introduced as a preview, with a full open-weight release (under a CC-BY-NC license) planned for early July. The evaluations presented here are based on our first checkpoint, captured immediately after the completion of our initial mid-training phase on June 3rd. The model continues to undergo additional training, including further pretraining, instruction tuning, and reinforcement learning. As a result, the benchmarks shared in this post reflect only these experiments and should be considered preliminary. Final benchmark results for the official release may differ as the model continues to improve. Approaches to Long Context Training Long-context training is a heavily researched topic, and there are many great publications that we stood on the shoulders of. Here's a little reading list of papers we found particularly useful: SkyLadder: Better and Faster Pretraining via Context Window Scheduling d Extension of Large Languages Models How to Train Long-Context Language Models (Effectively) - "ProLong" models PoSE: Efficient Context Window Extension of LLMs via Positional Skip-wise Training LongReD: Mitigating Short-Text Degradation of Long-Context Large Language Models via Restoration Distillation Data Engineering for Scaling Language Models to 128K Context

Of these, SkyLadder stood out with a very interesting finding: given a fixed token budget, models trained on shorter context lengths pretty consistently outperform models trained on longer context lengths on standard benchmarks. Their solution is a training schedule that gradually increases the context length, with excellent results. Unfortunately, that's haaaaard to implement. I didn't want to modify torchtitan to handle variable context lengths. (Neither did anyone else on the team, so I don't feel too bad about it.) So we instead went for the more traditional approach of doing the initial training on a shorter context length and then extending it at the end. Bits and pieces from all of these papers made their way into the final model in one way or another, which I'll get to in a bit. A Quick Note on Evaluation Evaluating the quality of a context extended model is a bit tricky. There are great benchmarks for long context tasks now (like NVIDIA's RULER ). But looking just at these can be a little misleading; context extension typically significantly degrades performance on short-context tasks. Long context is great, but what's the point if the model becomes substantially dumber in the process? In this post I'm going to focus on the short-context performance -- specifically MMLU and Big Bench Hard, as they were highly correlated with our full eval suite -- as a proxy for the amount of degradation. (Rest assured long-context performance was also evaluated. I just don't want to format six thousand tables right now.) Part 1: Dry Run (Or, Putting Down Tracks While the Train is Moving) We kicked off the initial pretraining of AFM-4.5B with a 4096-token context length. This was the sweet spot for a tasty MFU while still being long enough to capture some interesting dependencies. This ran for around 5 trillion tokens, at which point we did a small annealing run (decaying the learning rate to zero) to get a preview of the fully trained model. We then continued the pretrain with an additional 1.5T tokens from our main corpus, before shifting gears for the final 10% of the run. Here we mixed in a more targeted distribution of math, code, and reasoning samples while maintaining a roughly 1:1 split with general web data to avoid catastrophic forgetting (the kids are calling it "midtraining" now I believe). While the midtrain was running, I took the opportunity to start testing context extension strategies on the annealed model. This was a perfect dry run for the final model. The annealed checkpoint scored 65.4% on MMLU and 43.5% on BBH . Already great for its size and training stage. The goal: keep these numbers as high as possible while stretching the usable context window. Step 1: Positional Embeddings Pretty much all approaches to long-context training start with modifying the positional embeddings. There are many approaches here -- linear interpolation, ntk-aware, YaRN, LongRoPE, LongRoPE2, probably more that I'm forgetting. Each has production deployments and compelling arguments. Fortunately there's a handy heuristic we can use to pick one: which does DeepSeek use? And for DSv3 (and consequently R1), that's YaRN. So I went with YaRN, copying their parameters for a 4k->128k extension. As you might expect, just applying YaRN made things look pretty gnarly. MMLU dropped to 55.0% and BBH to 36.0 . This is typical. Training is always needed to recover performance after such a dramatic expansion, and this is where the fun begins. Step 2: Training Experiments Experiment 1: PoSE PoSE is tantalizing because it promises to teach long context by training on short sequences, just with position IDs sampled from the target range. Being able to get to 128k without touching 128k-length data would be a huge win. The first experiment I ran was applying PoSE to the annealed model with an 8192 sequence length and a 128k target length. Unfortunately, it didn't work so great. Post-PoSE, MMLU...

Excerpt shown — open the source for the full document.