What does this model signal mean?

Amazon (Nova) published amazon/GKA-primed-HQwen3-32B-Reasoner. This model signal is evidence of what shipped on model infrastructure and how the release is positioned. High-signal details: license apache-2.0 · 14.2K HF downloads · Amazon releases fine-tuned 32B reasoning model, modest traction.. onlylabs links this event to 1 captured evidence page and 6 related model signals.

Amazon (Nova) Model: amazon/GKA-primed-HQwen3-32B-Reasoner

Captured source

source ↗

Hugging Face/huggingface.co/amazon/GKA-primed-HQwen3-32B-Reasoner

amazon/GKA-primed-HQwen3-32B-Reasoner model card

Source ↗

published Mar 31, 2026seen Jun 6captured Jun 11http 200method plaintask text-generationlicense apache-2.0library transformersparams 34Bdownloads 14klikes 3

GKA-primed-HQwen3-32B-Reasoner

GKA-primed-HQwen3-32B-Reasoner is a Hybrid language model consisting of 50% Attention layers and 50% Gated KalmaNet (GKA) layers, primed from Qwen3-32B using the Hybrid Model Factory Priming pipeline. The model is trained for long-context reasoning and supports context lengths of 128K tokens.

GKA (pronounced as gee-ka) is a State-Space Model layer inspired by the Kalman Filter that solves an online ridge regression problem at test time, with constant memory and linear compute cost in the sequence length.

By combining Attention with GKA, our Hybrid model achieves up to 2× faster inference at long contexts while closely matching the base Transformer's quality.

Why Hybrid?

Each Primed Hybrid model is initialized from a base Transformer by converting a portion of its Attention layers into State-Space Model (SSM) layers that maintain a fixed-size recurrent state instead of a growing KV cache. At a 50% Hybrid ratio, roughly half the KV cache (which grows linearly with sequence length) is replaced with fixed-size SSM state. The practical benefits:

Higher throughput at long contexts — less memory on KV cache means more memory for batching
More concurrent sequences — ~2× as many concurrent sequences before hitting memory limits
Growing advantage with context length — at long contexts, Attention dominates the forward pass while SSM layers remain negligible in cost. Since the Hybrid model makes roughly half as many Attention calls as the base Transformer, the throughput advantage grows with context length

Increasing hybridization ratio, replacing more Attention layers with SSM layers, further reduces memory and increases throughput, typically at the expense of performance.

Model Overview

Type: Causal Language Model (Hybrid Attention + SSM)
Base Model: Qwen3-32B
Hybrid Layer Type: Gated KalmaNet (GKA)
Hybrid Ratio: 50% (32 Attention + 32 GKA layers)
Parameters: ~32B
Context Length: 128K natively
Precision: bfloat16
License: Apache 2.0

Benchmark Results

We consider the following Transformer as a baseline:

1. Qwen3-32B (thinking, from HF): The original Qwen model evaluated in thinking mode, which is the intended mode for reasoning tasks. This serves as the base Transformer from which we start the Priming procedure.

Reasoning Benchmarks

Evaluations on math reasoning (AIME24/25), science (GPQA), coding (LiveCodeBenchv5, Scicode), tool-calling (BFCLv3/v4), and instruction-following (IFBench). Evaluations are done using the Nemo Evaluator SDK. We have provided the evaluation configuration examples/evaluation/nemo_reasoning_evals.yaml for reproducibility. Evaluations are done at 64K generation length.

| Model | AIME24 | AIME25 | GPQA | LiveCodeBench-v5 | BFCLv4 (minus web-search) | BFCLv3 | IFBench | SciCode | Average | |-------|------|-----------|--------|-----------|----------|------|----|-----|-----| | Qwen3-32B (thinking, from HF) | 86.33 | 70.00 | 65.40 | 64.44 | 69.30 | 69.57 | 32.61 | 15.94 | 59.20 | | GKA-primed-HQwen3-32B-Reasoner | 87.67 | 81.67 | 67.30 | 70.24 | 70.14 | 66.34 | 48.22 | 12.34 | 62.99 |

*For BFCLv4, we remove the web-search subtask and weight each task by the number of entries (test examples) for that task.*

How close is the Hybrid model to the Transformer baseline on complex reasoning tasks? Our Primed GKA Hybrid outperforms the Qwen3-32B (thinking, from HF) baseline by ~3.8 points on average, despite using [ [!NOTE] > Interestingly, setting num_iter=0 effectively converts the GKA model to a Gated Linear Attention (GLA) model. Thus, one can think of increasing num iters as improving upon the initial solution of the GLA model.

Inference Efficiency

Sustained decode throughput (tokens/s) on 8× H200 GPUs (TP=8), measured during pure decode with a saturated KV cache. Benchmarked with random data (no prefix-caching benefits). See the full Inference guide for methodology and additional models.

| Model | 16K | 32K | 64K | 128K | |-----------------------------------------------------------|---------------|---------------|---------------|---------------| | GKA-primed-HQwen3-32B-Reasoner (num_iter=30, default) | 6,810 (1.29×) | 4,152 (1.45×) | 2,385 (1.82×) | 1,168 (1.99×) | | GKA-primed-HQwen3-32B-Reasoner (num_iter=10) | 7,778 (1.47×) | 4,534 (1.58×) | 2,537 (1.94×) | 1,200 (2.05×) | | GKA-primed-HQwen3-32B-Reasoner (num_iter=5) | 8,039 (1.52×) | 4,621 (1.61×) | 2,569 (1.96×) | 1,206 (2.06×) | | GKA-primed-HQwen3-32B-Reasoner (num_iter=1) | 8,177 (1.54×) | 4,678 (1.63×) | 2,593 (1.98×) | 1,210 (2.06×) | | GDN-primed-HQwen3-32B | 8,133 (1.53×) | 4,876 (1.70×) | 2,688 (2.06×) | 1,238 (2.11×) | | Qwen3-32B (thinking, from HF) | 5,299 | 2,865 | 1,308 | 586 |

Mean TTFT at the Transformer's saturated batch size (Hybrid model has memory to spare):

| Model | 16K | 32K | 64K | 128K | |-----------------------------------------------------------|-------------------|-------------------|-------------------|-------------------| | GKA-primed-HQwen3-32B-Reasoner (num_iter=30, default) | 52,053 ms (1.32×) | 58,613 ms (1.21×) | 68,241 ms (1.05×) | 84,935 ms (0.90×) | | GKA-primed-HQwen3-32B-Reasoner (num_iter=10) | 48,560 ms (1.23×) | 55,039 ms (1.13×) | 64,766 ms (0.99×) | 81,410 ms (0.86×) | | GKA-primed-HQwen3-32B-Reasoner (num_iter=5) | 47,958 ms (1.22×) | 54,320 ms (1.12×) | 63,826 ms (0.98×) | 80,369 ms (0.85×) | | GKA-primed-HQwen3-32B-Reasoner (num_iter=1) | 46,726 ms (1.19×) | 53,061 ms (1.09×) | 62,645 ms (0.96×) | 79,321 ms (0.84×) | | GDN-primed-HQwen3-32B | 42,492 ms (1.08×) | 48,417 ms (1.00×) | 57,525 ms (0.88×) | 73,145 ms (0.77×) | | Qwen3-32B (thinking, from HF) | 39,421 ms | 48,527 ms | 65,104 ms | 94,479 ms |

The decode throughput advantage grows with context length — from 1.29× at 16K to 1.99× at 128K (2.06× with num_iter=1) — thanks to GKA layers...

Excerpt shown — open the source for the full document.

Notability

notability 6.0/10

Amazon releases fine-tuned 32B reasoning model, modest traction.