amazon/GKA-primed-HQwen3-32B-Reasoner
Captured source
source ↗GKA-primed-HQwen3-32B-Reasoner
GKA-primed-HQwen3-32B-Reasoner is a Hybrid language model consisting of 50% Attention layers and 50% Gated KalmaNet (GKA) layers, primed from Qwen3-32B using the Hybrid Model Factory Priming pipeline. The model is trained for long-context reasoning and supports context lengths of 128K tokens.
GKA (pronounced as gee-ka) is a State-Space Model layer inspired by the Kalman Filter that solves an online ridge regression problem at test time, with constant memory and linear compute cost in the sequence length.
By combining Attention with GKA, our Hybrid model achieves up to 2× faster inference at long contexts while closely matching the base Transformer's quality.
Links
Why Hybrid?
Each Primed Hybrid model is initialized from a base Transformer by converting a portion of its Attention layers into State-Space Model (SSM) layers that maintain a fixed-size recurrent state instead of a growing KV cache. At a 50% Hybrid ratio, roughly half the KV cache (which grows linearly with sequence length) is replaced with fixed-size SSM state. The practical benefits:
- Higher throughput at long contexts — less memory on KV cache means more memory for batching
- More concurrent sequences — ~2× as many concurrent sequences before hitting memory limits
- Growing advantage with context length — at long contexts, Attention dominates the forward pass while SSM layers remain negligible in cost. Since the Hybrid model makes roughly half as many Attention calls as the base Transformer, the throughput advantage grows with context length
Increasing hybridization ratio, replacing more Attention layers with SSM layers, further reduces memory and increases throughput, typically at the expense of performance.
Model Overview
- Type: Causal Language Model (Hybrid Attention + SSM)
- Base Model: Qwen3-32B
- Hybrid Layer Type: Gated KalmaNet (GKA)
- Hybrid Ratio: 50% (32 Attention + 32 GKA layers)
- Parameters: ~32B
- Context Length: 128K natively
- Precision: bfloat16
- License: Apache 2.0
Benchmark Results
We consider the following Transformer as a baseline:
1. Qwen3-32B (thinking, from HF): The original Qwen model evaluated in thinking mode, which is the intended mode for reasoning tasks. This serves as the base Transformer from which we start the Priming procedure.
Reasoning Benchmarks
Evaluations on math reasoning (AIME24/25), science (GPQA), coding (LiveCodeBenchv5, Scicode), tool-calling (BFCLv3/v4), and instruction-following (IFBench). Evaluations are done using the Nemo Evaluator SDK. We have provided the evaluation configuration examples/evaluation/nemo_reasoning_evals.yaml for reproducibility. Evaluations are done at 64K generation length.
| Model | AIME24 | AIME25 | GPQA | LiveCodeBench-v5 | BFCLv4 (minus web-search) | BFCLv3 | IFBench | SciCode | Average | |-------|------|-----------|--------|-----------|----------|------|----|-----|-----| | Qwen3-32B (thinking, from HF) | 86.33 | 70.00 | 65.40 | 64.44 | 69.30 | 69.57 | 32.61 | 15.94 | 59.20 | | GKA-primed-HQwen3-32B-Reasoner | 87.67 | 81.67 | 67.30 | 70.24 | 70.14 | 66.34 | 48.22 | 12.34 | 62.99 |
*For BFCLv4, we remove the web-search subtask and weight each task by the number of entries (test examples) for that task.*
How close is the Hybrid model to the Transformer baseline on complex reasoning tasks? Our Primed GKA Hybrid outperforms the Qwen3-32B (thinking, from HF) baseline by ~3.8 points on average, despite using [ [!NOTE] > Interestingly, setting num_iter=0 effectively converts the GKA model to a Gated Linear Attention (GLA) model. Thus, one can think of increasing num iters as improving upon the initial solution of the GLA model.
Inference Efficiency
Sustained decode throughput (tokens/s) on 8× H200 GPUs (TP=8), measured during pure decode with a saturated KV cache. Benchmarked with random data (no prefix-caching benefits). See the full Inference guide for methodology and additional models.
| Model | 16K | 32K | 64K | 128K | |-----------------------------------------------------------|---------------|---------------|---------------|---------------| | GKA-primed-HQwen3-32B-Reasoner (num_iter=30, default) | 6,810 (1.29×) | 4,152 (1.45×) | 2,385 (1.82×) | 1,168 (1.99×) | | GKA-primed-HQwen3-32B-Reasoner (num_iter=10) | 7,778 (1.47×) | 4,534 (1.58×) | 2,537 (1.94×) | 1,200 (2.05×) | | GKA-primed-HQwen3-32B-Reasoner (num_iter=5) | 8,039 (1.52×) | 4,621 (1.61×) | 2,569 (1.96×) | 1,206 (2.06×) | | GKA-primed-HQwen3-32B-Reasoner (num_iter=1) | 8,177 (1.54×) | 4,678 (1.63×) | 2,593 (1.98×) | 1,210 (2.06×) | | GDN-primed-HQwen3-32B | 8,133 (1.53×) | 4,876 (1.70×) | 2,688 (2.06×) | 1,238 (2.11×) | | Qwen3-32B (thinking, from HF) | 5,299 | 2,865 | 1,308 | 586 |
Mean TTFT at the Transformer's saturated batch size (Hybrid model has memory to spare):
| Model | 16K | 32K | 64K | 128K | |-----------------------------------------------------------|-------------------|-------------------|-------------------|-------------------| | GKA-primed-HQwen3-32B-Reasoner (num_iter=30, default) | 52,053 ms (1.32×) | 58,613 ms (1.21×) | 68,241 ms (1.05×) | 84,935 ms (0.90×) | | GKA-primed-HQwen3-32B-Reasoner (num_iter=10) | 48,560 ms (1.23×) | 55,039 ms (1.13×) | 64,766 ms (0.99×) | 81,410 ms (0.86×) | | GKA-primed-HQwen3-32B-Reasoner (num_iter=5) | 47,958 ms (1.22×) | 54,320 ms (1.12×) | 63,826 ms (0.98×) | 80,369 ms (0.85×) | | GKA-primed-HQwen3-32B-Reasoner (num_iter=1) | 46,726 ms (1.19×) | 53,061 ms (1.09×) | 62,645 ms (0.96×) | 79,321 ms (0.84×) | | GDN-primed-HQwen3-32B | 42,492 ms (1.08×) | 48,417 ms (1.00×) | 57,525 ms (0.88×) | 73,145 ms (0.77×) | | Qwen3-32B (thinking, from HF) | 39,421 ms | 48,527 ms | 65,104 ms | 94,479 ms |
The decode throughput advantage grows with context length — from 1.29× at 16K to 1.99× at 128K (2.06× with num_iter=1) — thanks to GKA layers...
Excerpt shown — open the source for the full document.
Notability
notability 6.0/10Amazon releases fine-tuned 32B reasoning model, modest traction.