OpenBMB/MA-ProofBench
Lean
Captured source
source ↗OpenBMB/MA-ProofBench
Language: Lean
License: MIT
Stars: 2
Forks: 0
Open issues: 0
Created: 2026-06-10T09:48:54Z
Pushed: 2026-06-15T02:31:21Z
Default branch: main
Fork: no
Archived: no
README:
MA-ProofBench: A Two-Tiered Evaluation of LLMs for Theorem Proving in Mathematical Analysis
English | 中文
We introduce MA-ProofBench, to the best of our knowledge, the first formal benchmark for evaluating large language models (LLMs) on theorem proving in Mathematical Analysis. It contains 200 rigorously formalized theorem-proving problems in Lean 4 (v4.28.0), split into two difficulty tiers:
| Tier | Description | Source | Count | |--------------|----------------|------------------------------------------|-------| | Level I | Undergraduate | Basic Textbook Exercises | 100 | | Level II | Ph.D. | Exam Problems from Top-Tier Universities | 100 |
The problems span 6 core topics and 27 subcategories, including *measure and integration theory*, *complex analysis*, and *functional analysis*, among other categories. MA-ProofBench targets areas that are underrepresented in prior benchmarks and require deep reasoning about continuity, limits, and topological structures. Each problem is built through a human-led, LLM-assisted formalization pipeline with independent expert blind review to ensure mathematical fidelity.
Category Distribution
Problems are classified following the Mathematics Subject Classification (MSC) scheme:
| Category | Level I | Level II | | ------------------------------- | ------- | -------- | | Real Functions | 44 | 12 | | Functional Analysis | 15 | 31 | | Functions of a Complex Variable | 19 | 16 | | Measure & Integration | 13 | 17 | | Operator Theory | 4 | 23 | | Sequences, Series, Summability | 5 | 1 |
Evaluation
The evaluation/ package generates proofs via an OpenAI-compatible API and verifies them with Kimina Lean Server.
First, install the evaluation dependencies:
pip install -r requirements.txt
Then set up and start the Lean server (make sure it uses Lean v4.28.0):
cd kimina-lean-server cp .env.template .env # Optional bash setup.sh # Installs Lean, repl and mathlib4 pip install -r requirements.txt pip install . prisma generate python -m server
With the Lean server running, launch the evaluation script with an OpenAI-compatible API endpoint:
EXTERNAL_API_URL=https://your-api/v1/chat/completions \ EXTERNAL_API_KEY=sk-xxx \ EVAL_MODEL_PATH=your-model-name \ NUM_SAMPLES=8 \ EVAL_MAX_TOKENS=32768 \ EVAL_TEMPERATURE=1.0 \ bash evaluation/run.sh
To run a stage manually:
# Inference only EXTERNAL_API_URL=https://your-api/v1/chat/completions \ EXTERNAL_API_KEY=sk-xxx \ python -m evaluation.main --mode infer \ --model_path "$EVAL_MODEL_PATH" \ --output_path predictions.jsonl \ --split level1 level2 --num_samples 8 \ --eval_max_tokens 32768 --eval_temperature 1.0 # Verification only python -m evaluation.main --mode eval \ --predictions_path predictions.jsonl \ --output_path results.jsonl \ --split level1 level2 --num_samples 8
Citation
@article{ma-proofbench,
title={MA-ProofBench: A Two-Tiered Evaluation of LLMs for Theorem Proving in Mathematical Analysis},
author={Lushi Pu and Weiming Zhang and Xinheng Xie and Zixuan Fu and Bingxiang He and Hongya Lyu and Xin Li and Jie Zhou and Yudong Wang},
year={2026},
eprint={2606.13782},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2606.13782},
}License
This project is released under the [MIT License](LICENSE).
Notability
notability 2.0/10Low traction, new benchmark repo