ModelArcee AIArcee AIpublished Dec 14, 2025seen 5d

arcee-ai/AFM-4.5B-Base-KDA-Only

Open original ↗

Captured source

source ↗
published Dec 14, 2025seen 5dcaptured 14hhttp 200method plaintask image-feature-extractionlicense apache-2.0library transformersparams 5Bdownloads 22likes 12

AFM-4.5B-Base-KDA-Only

A research variant of AFM-4.5B-Base where all attention layers have been replaced with Kimi Delta Attention (KDA) through knowledge distillation. This model contains no full-attention layers.

> ⚠️ Research Model: This is an experimental model released for research purposes. For production use, see AFM-4.5B.

More details available in our blog post here: https://www.arcee.ai/blog/distilling-kimi-delta-attention-into-afm-4-5b-and-the-tool-we-used-to-do-it

Overview

This model explores whether full attention can be completely replaced with linear attention mechanisms. Using DistillKit, we distilled the original AFM-4.5B-Base (teacher) into a pure-KDA architecture (student).

Key characteristics:

  • All 24 layers use KDA instead of full attention
  • Trained up to 32k sequence length
  • Linear memory scaling with sequence length
  • Smoother long-context degradation compared to hybrid architectures

Architecture

| Component | Details | |-----------|---------| | Parameters | 4.5B | | Attention Type | Kimi Delta Attention (All layers) | | Positional Encoding | None (inherent to KDA) | | Max Training Length | 32k tokens | | Base Model | AFM-4.5B-Base |

Benchmark Results

Performance compared to the teacher model and hybrid configurations:

| Benchmark | Teacher (Full Attn) | KDA-Only | |-----------|:-------------------:|:--------:| | MMLU (Avg) | 63.1% | 55.8% | | ARC-Challenge | 55.6% | 49.9% | | HellaSwag (Norm) | 78.0% | 74.3% | | GSM8K (Math) | 52.1% | 26.8% |

Key Findings

  • Knowledge benchmarks: KDA-Only performs within statistical range of hybrid approaches on MMLU, ARC, and HellaSwag
  • Math performance: Larger drop on GSM8K compared to hybrid, though this may recover with longer training
  • Long-context behavior: Degrades more smoothly than hybrid models beyond training length—no cliff at 32k, just gradual falloff

Long-Context Performance (NIAH)

The pure-KDA model shows interesting long-context characteristics:

  • 100% single-needle retrieval up to 65k (beyond training length!)
  • Multikey retrieval degrades starting at 4k but smoothly
  • No sharp "cliff" like hybrid models exhibit past 32k

This behavior aligns with expectations for state-space-like architectures: fixed hidden state size creates inherent tension with growing context, but degradation is graceful.

Usage

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "arcee-ai/AFM-4.5B-Base-KDA-Only"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto"
)

prompt = "The theory of relativity states that"
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(model.device)

outputs = model.generate(
input_ids,
max_new_tokens=100,
do_sample=True,
temperature=0.7,
top_p=0.95
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Training Details

  • Method: Knowledge distillation from AFM-4.5B-Base using DistillKit
  • Teacher: AFM-4.5B-Base (full attention)
  • Student Architecture: All layers converted to KDA
  • Training Length: 32k sequence length

Intended Use

This model is intended for:

  • Research into linear attention mechanisms
  • Studying attention distillation techniques
  • Exploring pure state-space-like architectures for language modeling
  • Benchmarking KDA vs full attention tradeoffs

Limitations

  • Lower math/reasoning performance compared to full attention
  • Not instruction-tuned
  • Research checkpoint—not optimized for production

License

AFM-4.5B is released under the Apache-2.0 license.

Notability

notability 2.0/10

Low downloads, minor model release