ModelAmazon (Nova)Amazon (Nova)published Mar 31, 2026seen 5d

amazon/GKA-primed-HQwen3-8B-Reasoner

Open original ↗

Captured source

source ↗
published Mar 31, 2026seen 5dcaptured 18hhttp 200method plaintask text-generationlicense apache-2.0library transformersparams 8.5Bdownloads 3.9klikes 3

GKA-primed-HQwen3-8B-Reasoner

GKA-primed-HQwen3-8B-Reasoner is a Hybrid language model consisting of 50% Attention layers and 50% Gated KalmaNet (GKA) layers, primed from Qwen3-8B using the Hybrid Model Factory Priming pipeline. The model is trained for long-context reasoning and supports context lengths of 128K tokens.

GKA (pronounced as gee-ka) is a State-Space Model layer inspired by the Kalman Filter that solves an online ridge regression problem at test time, with constant memory and linear compute cost in the sequence length.

By combining Attention with GKA, our Hybrid model achieves up to 2× faster inference at long contexts while closely matching the base Transformer's quality.

Links

Why Hybrid?

Each Primed Hybrid model is initialized from a base Transformer by converting a portion of its Attention layers into State-Space Model (SSM) layers that maintain a fixed-size recurrent state instead of a growing KV cache. At a 50% Hybrid ratio, roughly half the KV cache (which grows linearly with sequence length) is replaced with fixed-size SSM state. The practical benefits:

  • Higher throughput at long contexts — less memory on KV cache means more memory for batching
  • More concurrent sequences — ~2× as many concurrent sequences before hitting memory limits
  • Growing advantage with context length — at long contexts, Attention dominates the forward pass while SSM layers remain negligible in cost. Since the Hybrid model makes roughly half as many Attention calls as the base Transformer, the throughput advantage grows with context length

Increasing hybridization ratio, replacing more Attention layers with SSM layers, further reduces memory and increases throughput, typically at the expense of performance.

Model Overview

  • Type: Causal Language Model (Hybrid Attention + SSM)
  • Base Model: Qwen3-8B
  • Hybrid Layer Type: Gated KalmaNet (GKA)
  • Hybrid Ratio: 50% (18 Attention + 18 GKA layers)
  • Parameters: ~8B
  • Context Length: 128K natively
  • Precision: bfloat16
  • License: Apache 2.0

Benchmark Results

We consider the following Transformer as a baseline:

1. Qwen3-8B (thinking, from HF): The original Qwen model evaluated in thinking mode, which is the intended mode for reasoning tasks. This serves as the base Transformer from which we start the Priming procedure.

Reasoning Benchmarks

Evaluations on math reasoning (AIME24/25), science (GPQA), coding (LiveCodeBenchv5, Scicode), tool-calling (BFCLv3/v4), and instruction-following (IFBench). Evaluations are done using the Nemo Evaluator SDK. We have provided the evaluation configuration examples/evaluation/nemo_reasoning_evals.yaml for reproducibility. Evaluations are done at 64K generation length.

| Model | AIME24 | AIME25 | GPQA | LiveCodeBench-v5 | BFCLv4 (minus web-search) | BFCLv3 | IFBench | SciCode | Average | |-------------------------------|------|-----------|--------|-----------|----------|------|----|-----|-----| | Qwen3-8B (thinking, from HF) | 78.67 | 71.0 | 57.77 | 57.94 | 68.30 | 66.46 | 31.60 | 10.63 | 55.29 | | GKA-primed-HQwen3-8B-Reasoner | 82.00 | 73.67 | 61.81 | 63.10 | 66.47 | 62.20 | 38.96 | 6.41 | 56.82 | | GDN-primed-HQwen3-8B-Reasoner | 82.00 | 73.33 | 61.49 | 62.94 | 63.27 | 57.44 | 37.80 | 2.50 | 55.10 |

*For BFCLv4, we remove the web-search subtask and weight each task by the number of entries (test examples) for that task.*

How close are the Hybrid models to the Transformer baseline on complex reasoning tasks? Our Primed Hybrid models are competitive with the Qwen3-8B (thinking, from HF) model despite [ [!NOTE] > Interestingly, setting num_iter=0 effectively converts the GKA model to a Gated Linear Attention (GLA) model. Thus, one can think of increasing num iters as improving upon the initial solution of the GLA model.

Inference Efficiency

Sustained decode throughput (tokens/s) on 8× H200 GPUs (TP=8), measured during pure decode with a saturated KV cache. Benchmarked with random data (no prefix-caching benefits). See the full Inference guide for methodology and additional models.

| Model | 16K | 32K | 64K | 128K | |----------------------------------------------------------|----------------|----------------|----------------|----------------| | GKA-primed-HQwen3-8B-Reasoner (num_iter=30, default) | 15,892 (1.78×) | 9,159 (1.77×) | 5,173 (1.89×) | 2,736 (2.23×) | | GKA-primed-HQwen3-8B-Reasoner (num_iter=10) | 17,261 (1.93×) | 9,668 (1.87×) | 5,359 (1.96×) | 2,801 (2.28×) | | GKA-primed-HQwen3-8B-Reasoner (num_iter=5) | 17,606 (1.97×) | 9,770 (1.89×) | 5,399 (1.97×) | 2,811 (2.29×) | | GKA-primed-HQwen3-8B-Reasoner (num_iter=1) | 17,485 (1.95×) | 9,780 (1.89×) | 5,413 (1.98×) | 2,812 (2.29×) | | GDN-primed-HQwen3-8B | 17,479 (1.95×) | 10,080 (1.95×) | 5,521 (2.01×) | 2,863 (2.33×) | | Qwen3-8B (thinking, from HF) | 8,951 | 5,174 | 2,740 | 1,227 |

Mean TTFT at the Transformer's saturated batch size (Hybrid model has memory to spare):

| Model | 16K | 32K | 64K | 128K | |----------------------------------------------------------|-------------------|-------------------|-------------------|-------------------| | GKA-primed-HQwen3-8B-Reasoner (num_iter=30, default) | 35,013 ms (1.26×) | 38,502 ms (1.18×) | 44,893 ms (1.06×) | 53,606 ms (0.85×) | | GKA-primed-HQwen3-8B-Reasoner (num_iter=10) | 33,008 ms (1.19×) | 36,334 ms (1.11×) | 42,076 ms (0.99×) | 51,404 ms (0.82×) | | GKA-primed-HQwen3-8B-Reasoner (num_iter=5) | 32,318 ms (1.17×) | 35,690 ms (1.09×) | 41,490 ms (0.98×) | 50,752 ms (0.81×) | | GKA-primed-HQwen3-8B-Reasoner (num_iter=1) | 31,741 ms (1.14×) | 35,716 ms (1.09×) | 39,963 ms (0.94×) | 50,232 ms (0.80×) | | GDN-primed-HQwen3-8B | 27,805 ms (1.00×) | 30,975 ms (0.95×) | 36,151 ms (0.85×) | 46,389 ms (0.74×) | | Qwen3-8B (thinking, from HF) | 27,736 ms | 32,661 ms | 42,462 ms | 62,922 ms |

The decode throughput advantage grows with context length —...

Excerpt shown — open the source for the full document.

Notability

notability 6.0/10

Amazon releases reasoning model, moderate traction