ByteDance-Seed/AInsteinBench
Python
Captured source
source ↗ByteDance-Seed/AInsteinBench
Language: Python
License: Apache-2.0
Stars: 9
Forks: 0
Open issues: 0
Created: 2025-12-23T03:40:59Z
Pushed: 2026-01-07T06:29:34Z
Default branch: public
Fork: no
Archived: no
README:
AInsteinBench
AInsteinBench is a benchmark for evaluating the capabilities of AI agents in solving scientific computing (a.k.a. scico) problems. It currently supports Einstein Toolkit and Multi-SWE-bench formats of coding questions.
Prerequisites
- Python 3.8+
- Docker
- Required packages:
pip install -r requirements.txt
Optional (for multi-swe-bench type questions)
# install multi-swe-bench git clone https://github.com/multi-swe-bench/multi-swe-bench.git && cd multi-swe-bench && make install # the docker images for the questions are available upon request
Optional (for Einstein Toolkit type questions)
cd curation/et # install Einstein Toolkit following https://einsteintoolkit.org/download.html curl -kLO https://raw.githubusercontent.com/gridaphobe/CRL/ET_2025_05/GetComponents chmod a+x GetComponents ./GetComponents --shallow https://bitbucket.org/einsteintoolkit/manifest/raw/ET_2025_05/einsteintoolkit.th # Then the curation/et/Cactus folder should be populated with the Einstein Toolkit code and tests. # the ADMConstraints Thorn are outdated but used in the docker, get it by git clone https://bitbucket.org/einsteintoolkit/einsteinanalysis.git cp -r einsteinanalysis/ADMConstraints . # prepare the docker environment docker pull rynge/einsteintoolkit
Quick Start
As a quick start, you can prepare the docker environments and evaluate the Einstein Toolkit questions by
python scripts/extract_dockerhub_images.py bash pull_images.sh --sample python evaluate_questions.py --questions data/questions/et_converted.jsonl --answers data/answers
results will be saved to evaluation_results.json.
Evaluation
Docker Setup
Before running evaluation, Pull the required images once:
# this may take a few minutes or hours depending on network, make sure you have 50GB+ free storage bash pull_images.sh --all
You can pull only the required images for the few questions you want to test in data/questions folder:
# extract the image names, this will be written to `sample_images.txt` python scripts/extract_dockerhub_images.py # pull only the sample images bash pull_images.sh --sample
You can also specify the dockerhub username by adding --username to the command.
Prepared Docker Images All evaluations in AInsteinBench are executed inside pre-built Docker containers to ensure environment consistency, reproducibility, and isolation across different scientific codebases.
We pre-build and publish all required Docker images on Docker Hub for both supported question types:
Einstein Toolkit questions
Images contain a fully configured Einstein Toolkit environment, including the required thorns, build system, and test dependencies.
Multi-SWE-bench questions
Images correspond to specific repositories and pull requests (e.g., PySCF, AMReX), with the codebase, dependencies, and test harness pre-installed.
Questions/Answers Preparation
Prepare the questions and answers in the required format. The answer filename should follow the corresponding "question_id". For EinsteinToolkit type questions, the answer should be one file of C/C++/Fortran code. For Multi-SWE-bench type questions, the answer should be a patch file.
As reference, once you download the questions in data/questions folder, you can extract the answers by running python scripts/extract_answers.py. The answers will be written to data/answers folder.
As an example, the reference answer to the question MSB_pyscf_pyscf_pr2373
diff --git a/pyscf/tdscf/rhf.py b/pyscf/tdscf/rhf.py
index b1b680b69f..d2cb63e086 100644
--- a/pyscf/tdscf/rhf.py
+++ b/pyscf/tdscf/rhf.py
@@ -530,16 +530,20 @@ def _charge_center(mol):
return numpy.einsum('z,zr->r', charges, coords)/charges.sum()
def _contract_multipole(tdobj, ints, hermi=True, xy=None):
+ '''ints is the integral tensor of a spin-independent operator'''
if xy is None: xy = tdobj.xy
+ nstates = len(xy)
+ pol_shape = ints.shape[:-2]
+ nao = ints.shape[-1]
+
+ if not tdobj.singlet:
+ return numpy.zeros((nstates,) + pol_shape)
+
mo_coeff = tdobj._scf.mo_coeff
mo_occ = tdobj._scf.mo_occ
orbo = mo_coeff[:,mo_occ==2]
orbv = mo_coeff[:,mo_occ==0]
- nstates = len(xy)
- pol_shape = ints.shape[:-2]
- nao = ints.shape[-1]
-
#Incompatible to old numpy version
#ints = numpy.einsum('...pq,pi,qj->...ij', ints, orbo.conj(), orbv)
ints = lib.einsum('xpq,pi,qj->xij', ints.reshape(-1,nao,nao), orbo.conj(), orbv)You can use your favorite agent to work on questions and provide answers. We provide a minimal working agent in scripts/run_agent.py. You can run it by:
python scripts/run_agent.py \ --question-file data/questions/msb_converted.jsonl \ --api-key $OPENAI_API_KEY \ --output-dir outputs/
Run Evaluations
Evaluate the answer by running python evaluate_questions.py --questions --answers with your questions file and answers directory. For example, to evaluate the example questions in data/questions and the reference answers in data/answers, run:
# Evaluate Einstein Toolkit questions python evaluate_questions.py \ --questions data/questions/et_converted.jsonl \ --answers data/answers \ --output data/eval/et_eval.json # Evaluate Multi-SWE-bench questions python evaluate_questions.py \ --questions data/questions/msb_converted.jsonl \ --answers data/answers \ --output data/eval/msb_eval.json
Repository Structure
AInsteinBench/ ├── evaluate_questions.py # Unified evaluator ├── ainsteinbench/ │ ├── question.py # Question class │ └── utils/ # Utility modules ├── curation/ │ ├── et/ # ET data curation │ │ ├── ET_evaluator.py # Original ET evaluator │ │ └── config_server.json # ET configuration │ └── msb/ # MSB data curation │ ├── log_parser.py # log parser │ ├── AMReXCodes/ # AMReX questions │ └── pyscf/ # PySCF questions ├── scripts/ ├── data/ │ ├── questions/ # Unified format questions │ ├── answers/ # Reference answers │ ├── demo/ # Demo questions (non-runnable) │ └── raw/ # Raw datasets └── tests/ ├── test_question.py # Question class tests └── test_config.py # Config tests
Data Curation
Question Format
Both question types share a common structure with various question "content", "environment", "answer",…
Excerpt shown — open the source for the full document.
Notability
notability 1.0/10New repo, very low traction