ModelIBM (Granite)IBM (Granite)published Mar 3, 2026seen 5d

ibm-granite/granite-4.0-3b-vision

Open original ↗

Captured source

source ↗
published Mar 3, 2026seen 5dcaptured 11hhttp 200method plaintask image-text-to-textlicense apache-2.0library transformersparams 4Bdownloads 83klikes 111

Granite 4.0 3B Vision

Model Summary: Granite-4.0-3B-Vision is a vision-language model (VLM) designed for enterprise-grade document data extraction. It focuses on specialized, complex extraction tasks that ultracompact models often struggle with:

  • Chart extraction: Converting charts into structured, machine-readable formats (Chart2CSV, Chart2Summary, and Chart2Code)
  • Table extraction: Accurately extracting tables with complex layouts from document images to JSON, HTML, or OTSL
  • Semantic Key-Value Pair (KVP) extraction: Extracting values based on key names and descriptions across diverse document layouts

The model is delivered as a LoRA adapter on top of Granite 4.0 Micro, with a 3.5B base LLM and 0.5B LoRA adapters. This enables a single deployment to support both multimodal document understanding and text-only workloads — the base model handles text-only requests without loading the adapter. See [Model Architecture](#model-architecture) for details.

The methodology and data (ChartNet) used for this model are described in the paper ChartNet: A Million-Scale, High-Quality Multimodal Dataset for Robust Chart Understanding.

While our focus is on specialized document extraction tasks, the current model preserves and extends the capabilities of Granite-Vision-3.3 2B, ensuring that existing users can adopt it seamlessly with no changes to their workflow. It continues to support vision‑language tasks such as producing detailed natural‑language descriptions from images (image‑to‑text). The model can be used standalone and integrates seamlessly with Docling to enhance document processing pipelines with deep visual understanding capabilities.

  • Developer: IBM Research
  • GitHub Repository: https://github.com/ibm-granite
  • Release Date: March 27th, 2026
  • License: Apache 2.0

Supported Tasks

The model supports specialized extraction tasks, each activated by a simple task tag in the user message. The chat template automatically expands tags into the full prompt — no need to write verbose instructions.

| Tag | Task | Output | |-----|------|--------| | ` | Chart to CSV | CSV table with headers and numeric values | | | Chart to Python code | Python code that recreates the chart | | | Chart to summary | Natural-language description of the chart | | | Table extraction (JSON) | Structured JSON with dimensions and cells | | | Table extraction (HTML) | HTML markup | | ` | Table extraction (OTSL) | OTSL markup with cell/merge tags | | KVP (see prompt instructions below) | Schema based Key-Value pairs extraction | JSON with nested dictionaries and arrays |

Model Performance

Benchmark Results

We compare Granite-4.0-3B-Vision against leading small VLMs across multiple extraction benchmarks.

Chart Extraction

We evalute chart extraction using the human-verified test-set from ChartNet. The models are evaluated using LLM-as-a-judge (GPT4o) comparing the model prediction against the ground-truth. We report the average scores 0-100 on chart2csv and chart2summary extraction tasks.

Table Extraction

To benchmark table extraction, we construct a unified evaluation suite spanning multiple datasets and settings to assess end-to-end table extraction capabilities of vision-language models:

1. [TableVQA-Extract](https://arxiv.org/abs/2404.19205) — Converts the original visual table QA benchmark into a cropped table extraction task. 2. [OmniDocBench-tables](https://arxiv.org/abs/2412.07626) — A document parsing benchmark over diverse PDF types with detailed annotations for layout, text, formulas, and tables. We use the subset of pages that contain one or more tables to evaluate table extraction in full-page settings. 3. [PubTablesV2](https://arxiv.org/abs/2512.10888) — A large-scale table extraction benchmark evaluated in both cropped-table and full-page document settings.

To unify evaluation, we replace each dataset’s original annotations (e.g., Q&A pairs) with a single instruction: *extract the table(s) from the image in HTML format*, using the corresponding HTML as ground truth. For full-page inputs, only tabular elements are considered; when multiple tables appear, they are aggregated into a Python list.

We report results using TEDS (Tree-Edit Distance-based Similarity), which measures structural and content similarity between predicted and ground-truth HTML tables.

Results are presented separately for cropped-table and full-page settings to highlight performance across controlled and realistic document scenarios.

Key-Value Pair (KVP) Extraction

We evaluate on VAREX, a benchmark for multimodal structured extraction from documents. Granite-4.0-3B-Vision achieves 85.5% exact-match accuracy (zero-shot), ranking 3rd among 2–4B parameter models as of March 2026 (view results here).

Setup

Tested with python=3.11

pip install torch==2.10.0 --index-url https://download.pytorch.org/whl/cu128
pip install transformers==4.57.6 peft==0.18.1 tokenizers==0.22.2 pillow==12.1.1

Usage with Transformers

import re
from io import StringIO

import pandas as pd
import torch
from transformers import AutoProcessor, AutoModelForImageTextToText
from PIL import Image
from huggingface_hub import hf_hub_download

model_id = "ibm-granite/granite-4.0-3b-vision"
device = "cuda" if torch.cuda.is_available() else "cpu"

processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForImageTextToText.from_pretrained(
model_id,
trust_remote_code=True,
dtype=torch.bfloat16,
device_map=device
).eval()

# Optional: merge LoRA adapters into base weights for faster inference.
# Prefer to skip when using text-only tasks, as the LoRA adapters are vision-specific.
model.merge_lora_adapters()

def run_inference(model, processor, images, prompts):
"""Run batched inference on image+prompt pairs (one image per prompt)."""
conversations = [
[{"role": "user", "content": [
{"type": "image"},
{"type": "text", "text": prompt},
]}]
for prompt in prompts
]
texts = [
processor.apply_chat_template(conv, tokenize=False,…

Excerpt shown — open the source for the full document.

Notability

notability 8.0/10

Strong downloads, notable model from IBM