ModelSarvam AISarvam AIpublished Mar 3, 2026seen 5d

sarvamai/sarvam-105b

Open original ↗

Captured source

source ↗
published Mar 3, 2026seen 5dcaptured 11hhttp 200method plaintask text-generationlicense apache-2.0library transformersparams 106Bdownloads 13klikes 273

!image

Want a smaller model? Download Sarvam-30B!

Index

1. [Introduction](#introduction) 2. [Architecture](#architecture) 3. [Benchmarks](#benchmarks)

  • Knowledge & Coding
  • Reasoning & Math
  • Agentic

4. [Inference](#inference)

5. [Footnote](#footnote) 6. [Citation](#citation)

Introduction

Sarvam-105B is an advanced Mixture-of-Experts (MoE) model with 10.3B active parameters, designed for superior performance across a wide range of complex tasks. It is highly optimized for complex reasoning, with particular strength in agentic tasks, mathematics, and coding.

Sarvam-105B is a top-tier performer, consistently matching or surpassing several major closed-source models and staying within a narrow margin of frontier models across diverse reasoning and agentic benchmarks. It demonstrates exceptional agentic and reasoning capabilities in real-world applications such as web search and technical troubleshooting.

A major focus during training was the Indian context and languages, resulting in state-of-the-art performance across 22 Indian languages for its model size.

Sarvam-105B is open-sourced under the Apache License. For more details, see our blog.

Architecture

The 105B model adopts an MLA-style attention stack with decoupled QK head dimensions (q_head_dim=192 split into RoPE and noPE components, v_head_dim=128) and a large head_dim of 576, enabling higher representational bandwidth per head while keeping the hidden size at 4096. This approach improves attention expressivity and long-context extrapolation (via YaRN scaling with a factor of 40 and 128K context). It has an intermediate_size (16384) and moe_intermediate_size (2048), combined with top-8 routing over 128 experts, which increases per-token active capacity while keeping activation cost manageable. The model has one shared expert, a routed scaling factor of 2.5, and auxiliary-loss-free router balancing.

Benchmarks

Knowledge & Coding

| Benchmark | Sarvam-105B | GLM-4.5-Air | GPT-OSS-120B | Qwen3-Next-80B-A3B-Thinking | |---|---|---|---|---| | Math500 | 98.6 | 97.2 | 97.0 | 98.2 | | Live Code Bench v6 | 71.7 | 59.5 | 72.3 | 68.7 | | MMLU | 90.6 | 87.3 | 90.0 | 90.0 | | MMLU Pro | 81.7 | 81.4 | 80.8 | 82.7 | | Writing Bench | 80.5 | 83.8 | 86.5 | 84.6 | | Arena Hard v2 | 71.0 | 68.1 | 88.5 | 68.2 | | IF Eval | 84.8 | 83.5 | 85.4 | 88.9 |

Reasoning & Math

| Benchmark | Sarvam-105B | GLM-4.5-Air | GPT-OSS-120B | Qwen3-Next-80B-A3B-Thinking | |---|---|---|---|---| | GPQA Diamond | 78.7 | 75.0 | 80.1 | 77.2 | | AIME 25 (w/ Tools) | 88.3 (96.7) | 83.3 | 90.0 | 87.8 | | Beyond AIME | 69.1 | 61.5 | 51.0 | 68.0 | | HMMT (Feb 25) | 85.8 | 69.2 | 90.0 | 73.9 | | HMMT (Nov 25) | 85.8 | 75.0 | 90.0 | 80.0 |

Agentic

| Benchmark | Sarvam-105B | GLM-4.5-Air | GPT-OSS-120B | Qwen3-Next-80B-A3B-Thinking | |---|---|---|---|---| | BrowseComp | 49.5 | 21.3 | - | 38.0 | | SWE Bench Verified (SWE-Agent Harness) | 45.0 | 57.6 | 50.6 | 60.9 | | τ² Bench (avg.) | 68.3 | 53.2 | 65.8 | 55.0 |

> See footnote for evaluation details.

Inference

Huggingface

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig

model_name = "sarvamai/sarvam-105b"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True, device_map="auto")

def generate_text(
prompt: str,
max_new_tokens: int = 2048,
temperature: float = 0.8,
top_p: float = 0.95,
repetition_penalty: float = 1.0,
) -> None:
inputs = tokenizer(prompt, return_tensors="pt").to("cuda:0")

generation_config = GenerationConfig(
max_new_tokens=max_new_tokens,
repetition_penalty=repetition_penalty,
temperature=temperature,
top_p=top_p,
do_sample=True,
)

with torch.no_grad():
output_ids = model.generate(
input_ids=inputs["input_ids"],
attention_mask=inputs["attention_mask"],
generation_config=generation_config,
)
return tokenizer.decode(output_ids[0], skip_special_tokens=True)

prompts = [
"Which country won the FIFA World Cup in 2012?",
]

for prompt in prompts:
templated_prompt = tokenizer.apply_chat_template(
[{"role": "user", "content": prompt}],
tokenize=False,
add_generation_prompt=True,
enable_thinking=True
)
output = generate_text(templated_prompt, max_new_tokens=512)
print("Prompt: ", prompt)
print("Generated text: ", output)
print("=" * 100)

SGLang

Install latest SGLang from source

git clone https://github.com/sgl-project/sglang.git
cd sglang
pip install -e "python[all]"

Instantiate model and Run

import sglang as sgl
from transformers import AutoTokenizer

model_path = "sarvamai/sarvam-105b"
engine = sgl.Engine(
model_path=model_path,
tp_size=4,
mem_fraction_static=0.70,
trust_remote_code=True,
dtype="bfloat16",
moe_runner_backend="flashinfer_cutedsl",
prefill_attention_backend="fa3",
decode_attention_backend="flashmla",
disable_radix_cache=False,
)

sampling_params = {
"temperature": 0.8,
"max_new_tokens": 2048,
"repetition_penalty": 1.0,
}

prompts = [
"Which band released the album Dark Side of the Moon in 1973?",
]

outputs = engine.generate([
tokenizer.apply_chat_template([
{"role": "user", "content": prompt}],
tokenize=False,
add_generation_prompt=True,
enable_thinking=True)
for prompt in prompts],
sampling_params)
for p, o in zip(prompts, outputs):
print("Prompt: ", p)
print("Generated text: ", o['text'])
print("=" * 100)

vLLM

Note: currently a PR is open for native support for the Sarvam models in vLLM (link). Therefore, we have 2 options here.

Option 1: install from source (hard)

  • Use the custom fork here: link
  • Follow the instructions here to install from source: link

Option 2: hot-patch (easy)

  • Run [hotpatch_vllm.py](./hotpatch_vllm.py)
  • This will do the following:
  • install vllm=0.15.0
  • add 2 model entries to registry.py
  • download the model executors for sarvam-105b and sarvam-30b

Once this is done, you can run vLLM as usual

from vllm…

Excerpt shown — open the source for the full document.

Notability

notability 7.0/10

Notable large model release, moderate traction