What does this model signal mean?

Arcee AI published arcee-ai/AFM-4.5B-Base-KDA-Only. This model signal is evidence of what shipped on model infrastructure and how the release is positioned. High-signal details: license apache-2.0 · 46 HF downloads · Arcee AI's 4.5B parameter base model trained using knowledge distillation.. onlylabs links this event to 1 captured evidence page and 6 related model signals.

Arcee AI Model: arcee-ai/AFM-4.5B-Base-KDA-Only

Captured source

source ↗

Hugging Face/huggingface.co/arcee-ai/AFM-4.5B-Base-KDA-Only

arcee-ai/AFM-4.5B-Base-KDA-Only model card

Source ↗

published Dec 14, 2025seen Jun 6captured Jun 11http 200method plaintask feature-extractionlicense apache-2.0library transformersparams 5Bdownloads 46likes 12

AFM-4.5B-Base-KDA-Only

A research variant of AFM-4.5B-Base where all attention layers have been replaced with Kimi Delta Attention (KDA) through knowledge distillation. This model contains no full-attention layers.

> ⚠️ Research Model: This is an experimental model released for research purposes. For production use, see AFM-4.5B.

More details available in our blog post here: https://www.arcee.ai/blog/distilling-kimi-delta-attention-into-afm-4-5b-and-the-tool-we-used-to-do-it

Overview

This model explores whether full attention can be completely replaced with linear attention mechanisms. Using DistillKit, we distilled the original AFM-4.5B-Base (teacher) into a pure-KDA architecture (student).

Key characteristics:

All 24 layers use KDA instead of full attention
Trained up to 32k sequence length
Linear memory scaling with sequence length
Smoother long-context degradation compared to hybrid architectures

Architecture

| Component | Details | |-----------|---------| | Parameters | 4.5B | | Attention Type | Kimi Delta Attention (All layers) | | Positional Encoding | None (inherent to KDA) | | Max Training Length | 32k tokens | | Base Model | AFM-4.5B-Base |

Benchmark Results

Performance compared to the teacher model and hybrid configurations:

| Benchmark | Teacher (Full Attn) | KDA-Only | |-----------|:-------------------:|:--------:| | MMLU (Avg) | 63.1% | 55.8% | | ARC-Challenge | 55.6% | 49.9% | | HellaSwag (Norm) | 78.0% | 74.3% | | GSM8K (Math) | 52.1% | 26.8% |

Key Findings

Knowledge benchmarks: KDA-Only performs within statistical range of hybrid approaches on MMLU, ARC, and HellaSwag
Math performance: Larger drop on GSM8K compared to hybrid, though this may recover with longer training
Long-context behavior: Degrades more smoothly than hybrid models beyond training length—no cliff at 32k, just gradual falloff

Long-Context Performance (NIAH)

The pure-KDA model shows interesting long-context characteristics:

100% single-needle retrieval up to 65k (beyond training length!)
Multikey retrieval degrades starting at 4k but smoothly
No sharp "cliff" like hybrid models exhibit past 32k

This behavior aligns with expectations for state-space-like architectures: fixed hidden state size creates inherent tension with growing context, but degradation is graceful.

Usage

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "arcee-ai/AFM-4.5B-Base-KDA-Only"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto"
)

prompt = "The theory of relativity states that"
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(model.device)

outputs = model.generate(
input_ids,
max_new_tokens=100,
do_sample=True,
temperature=0.7,
top_p=0.95
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Training Details

Method: Knowledge distillation from AFM-4.5B-Base using DistillKit
Teacher: AFM-4.5B-Base (full attention)
Student Architecture: All layers converted to KDA
Training Length: 32k sequence length

Intended Use

This model is intended for:

Research into linear attention mechanisms
Studying attention distillation techniques
Exploring pure state-space-like architectures for language modeling
Benchmarking KDA vs full attention tradeoffs

Limitations

Lower math/reasoning performance compared to full attention
Not instruction-tuned
Research checkpoint—not optimized for production

License

AFM-4.5B is released under the Apache-2.0 license.

Notability

notability 2.0/10

Low downloads, minor model release