RepoSarvam AISarvam AIpublished Feb 8, 2026seen 5d

sarvamai/olmOCR-bench-sarvam-api

Python

Open original ↗

Captured source

source ↗

sarvamai/olmOCR-bench-sarvam-api

Language: Python

Stars: 4

Forks: 2

Open issues: 1

Created: 2026-02-08T13:08:27Z

Pushed: 2026-02-09T14:37:09Z

Default branch: main

Fork: no

Archived: no

README:

olmOCR-Bench Evaluation

Run Sarvam Vision OCR on PDFs/images and evaluate the outputs against olmOCR-Bench.

Scripts

| Script | Purpose | |---|---| | run_sarvam_vision_inference.py | Run Sarvam Vision OCR on PDFs/images and save .md outputs | | postproccessing.py | Post-processing helpers (join line-break hyphens, unwrap simple math/LaTeX) — imported by inference script | | run_eval.py | Evaluate .md outputs against olmOCR-Bench |

Setup

cd olmocr_bench_eval
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt

# For math tests (KaTeX rendering)
playwright install && playwright install-deps

Download benchmark data

Download one of the following datasets:

huggingface-cli login

# Option A: English-only subset (default for commands below)
huggingface-cli download --repo-type dataset sarvamai/olmOCR-Bench-English --local-dir ./olmocr_bench_english

# OR

# Option B: Full benchmark (all languages)
huggingface-cli download --repo-type dataset allenai/olmOCR-bench --local-dir ./olmocr_bench_full

The HF download creates a nested bench_data/ inside the local directory.

Directory layout

olmocr_bench_english/ # HF download root
└── bench_data/ # Actual data directory
├── pdfs/ # PDFs from olmOCR-bench
│ ├── arxiv_math/2503.04048_pg46.pdf
│ └── ...
├── arxiv_math.jsonl # Test definitions (from HF dataset)
├── headers_footers.jsonl
├── table_tests.jsonl
├── ...
└── my_model/ # Your .md outputs (one folder per model)
├── arxiv_math/
│ ├── 2503.04048_pg46_pg1_repeat0.md
│ └── ...
└── ...

Naming rule: {pdf_basename}_pg{page}_repeat{N}.md. For a single run per page, use _repeat0. The _repeat suffix is auto-added by run_eval.py if missing.

Step 1: Run inference

export SARVAM_API_KEY="your_key_here"

# With post-processing enabled
python run_sarvam_vision_inference.py olmocr_bench_english/bench_data/pdfs/ olmocr_bench_english/bench_data/my_model \
--join-line-break-hyphens --unwrap-simple-math-latex

Already-processed files are automatically skipped.

Post-processing (--join-line-break-hyphens / --unwrap-simple-math-latex)

Two independent transforms (both optional). Logic lives in postproccessing.py.

1. Join line-break hyphens — joins words split by hyphens at line breaks (skips LaTeX/code blocks):

| Before | After | |---|---| | experi-\nmental | experimental | | approx- \nimation | approximation |

2. Unwrap simple math/LaTeX — replaces trivial LaTeX with plain text/Unicode; real math is kept:

| Before | After | |---|---| | $\alpha$, $\le$, $\times$ | α, , × | | \underline{Title} | Title | | $42$, $95\%$, $hello$ | 42, 95%, hello | | $x^2 + y^2$ | $x^2 + y^2$ *(unchanged)* |

Step 2: Evaluate

# Evaluate all candidate folders
python run_eval.py -d olmocr_bench_english/bench_data

# Evaluate one candidate
python run_eval.py -d olmocr_bench_english/bench_data -c my_model

# With options
python run_eval.py -d olmocr_bench_english/bench_data --force --skip-baseline --test-report report.html

Output: overall score (%), 95% confidence interval, and per-test-type breakdown.

References

Notability

notability 1.0/10

Low-star repo, minor activity