ModelIBM (Granite)IBM (Granite)published Oct 7, 2025seen 5d

ibm-granite/granite-4.0-350m-base

Open original ↗

Captured source

source ↗
published Oct 7, 2025seen 5dcaptured 11hhttp 200method plaintask text-generationlicense apache-2.0library transformersparams 352Mdownloads 13klikes 26

Granite-4.0-350M-Base

Model Summary: Granite-4.0-350M-Base is a lightweight decoder-only language model designed for scenarios where efficiency and speed are critical. They can run on resource-constrained devices such as smartphones or IoT hardware, enabling offline and privacy-preserving applications. It also supports Fill-in-the-Middle (FIM) code completion through the use of specialized prefix and suffix tokens. The model is trained from scratch on approximately 15 trillion tokens following a four-stage training strategy: 10 trillion tokens in the first stage, 2 trillion in the second, another 2 trillion in the third, and 0.5 trillion in the final stage.

Supported Languages: English, German, Spanish, French, Japanese, Portuguese, Arabic, Czech, Italian, Korean, Dutch, and Chinese. Users may fine-tune Granite 4.0 Nano models to support languages beyond those included in this list.

Intended Use: Prominent use cases of LLMs in text-to-text generation include summarization, text classification, extraction, question-answering and code-completion (including FIM) tasks. Moreover, these lightweight models can serve as baseline to create task-specific models for different applications.

Generation: This is a simple example of how to use Granite-4.0-350M-Base model.

Install the following libraries:

pip install torch torchvision torchaudio
pip install accelerate
pip install transformers

Then, copy the code snippet below to run the example.

from transformers import AutoModelForCausalLM, AutoTokenizer
device = "cuda"

model_path = "ibm-granite/granite-4.0-350M-base"

tokenizer = AutoTokenizer.from_pretrained(model_path)
# drop device_map if running on CPU
model = AutoModelForCausalLM.from_pretrained(model_path, device_map=device)
model.eval()
# change input text as desired
input_text = "The capital of France is"
# tokenize the text
input_tokens = tokenizer(input_text, return_tensors="pt").to(device)
# generate output tokens
output = model.generate(**input_tokens, max_length=10)
# decode output tokens into text
output = tokenizer.batch_decode(output)
# print output
print(output[0])

Expected output:

The capital of France is Paris.

Evaluation Results:

Benchmarks Metric 350M Dense H 350M Dense 1B Dense H 1B Dense

General Tasks

MMLU 5-shot 33.08 36.07 59.82 58.71

MMLU-Pro 5-shot,CoT 11.29 10.08 29.96 23.45

BBH 3-shot, CoT 32.19 29.96 57.73 48.45

AGI EVAL 3-shot 28.97 29.2 48.95 47.46

DROP 5-shot 29.77 28.56 58.18 57.18

Math Tasks

GSM8K 8-shot 24.11 24.41 62.4 57.39

Minerva Math 4-shot 9.96 11.5 30.3 21.3

Code Tasks

HumanEval pass@1 [StarCoder Prompt] 34.6 35.61 68.08 68.26

HumanEval pass@1 32 34 60 59

HumanEval+ pass@1 29 29 57 56

MBPP pass@1 45 17 72 65

MBPP+ pass@1 38 16 60 54

Multilingual Tasks

MMMLU 5-shot 30.93 31.02 46.73 48.55

INCLUDE 5-shot 27.32 29.26 42.6 43.8

MGSM 8-shot 13.92 15.12 46.96 41.52

Multilingual Benchmarks and thr included languages:

Benchmarks

Langs

Languages

MMMLU 11 ar, de, en, es, fr, ja, ko, pt, zh, bn, hi

INCLUDE 14 hi, bn, ta, te, ar, de, es, fr, it, ja, ko, nl, pt, zh

MGSM 5 en, es, fr, ja, zh

Model Architecture:

Granite-4.0-350M-Base is based on a decoder-only dense transformer architecture. Core components of this architecture are: GQA, MLP with SwiGLU, RMSNorm, and shared input/output embeddings.

Model 350M Dense H 350M Dense 1B Dense H 1B Dense

Embedding size 1024 768 2048 1536

Number of layers 28 attention 4 attention / 28 Mamba2 40 attention 4 attention / 36 Mamba2

Attention head size 64 64 128 128

Number of attention heads 16 12 16 12

Number of KV heads 4 4 4 4

Mamba2 state size - 128 - 128

Number of Mamba2 heads - 48 - 48

MLP / Shared expert hidden size 2048 2048 4096 4096

Num. Experts - - - -

Num. active Experts - - - -

Expert hidden size - - - -

MLP activation SwiGLU SwiGLU SwiGLU SwiGLU

Sequence length 32K 32K 128K 128K

Position embedding RoPE NoPE RoPE NoPE

Parameters

350M 340M <td

Notability

notability 6.0/10

IBM small base model, modest downloads