RepoByteDance (Doubao/Seed)ByteDance (Doubao/Seed)published Dec 23, 2025seen 5d

ByteDance-Seed/AInsteinBench

Python

Open original ↗

Captured source

source ↗
published Dec 23, 2025seen 5dcaptured 15hhttp 200method plain

ByteDance-Seed/AInsteinBench

Language: Python

License: Apache-2.0

Stars: 9

Forks: 0

Open issues: 0

Created: 2025-12-23T03:40:59Z

Pushed: 2026-01-07T06:29:34Z

Default branch: public

Fork: no

Archived: no

README:

AInsteinBench

AInsteinBench is a benchmark for evaluating the capabilities of AI agents in solving scientific computing (a.k.a. scico) problems. It currently supports Einstein Toolkit and Multi-SWE-bench formats of coding questions.

Prerequisites

  • Python 3.8+
  • Docker
  • Required packages: pip install -r requirements.txt

Optional (for multi-swe-bench type questions)

# install multi-swe-bench
git clone https://github.com/multi-swe-bench/multi-swe-bench.git && cd multi-swe-bench && make install
# the docker images for the questions are available upon request

Optional (for Einstein Toolkit type questions)

cd curation/et
# install Einstein Toolkit following https://einsteintoolkit.org/download.html
curl -kLO https://raw.githubusercontent.com/gridaphobe/CRL/ET_2025_05/GetComponents
chmod a+x GetComponents
./GetComponents --shallow https://bitbucket.org/einsteintoolkit/manifest/raw/ET_2025_05/einsteintoolkit.th
# Then the curation/et/Cactus folder should be populated with the Einstein Toolkit code and tests.
# the ADMConstraints Thorn are outdated but used in the docker, get it by
git clone https://bitbucket.org/einsteintoolkit/einsteinanalysis.git
cp -r einsteinanalysis/ADMConstraints .
# prepare the docker environment
docker pull rynge/einsteintoolkit

Quick Start

As a quick start, you can prepare the docker environments and evaluate the Einstein Toolkit questions by

python scripts/extract_dockerhub_images.py
bash pull_images.sh --sample
python evaluate_questions.py --questions data/questions/et_converted.jsonl --answers data/answers

results will be saved to evaluation_results.json.

Evaluation

Docker Setup

Before running evaluation, Pull the required images once:

# this may take a few minutes or hours depending on network, make sure you have 50GB+ free storage
bash pull_images.sh --all

You can pull only the required images for the few questions you want to test in data/questions folder:

# extract the image names, this will be written to `sample_images.txt`
python scripts/extract_dockerhub_images.py
# pull only the sample images
bash pull_images.sh --sample

You can also specify the dockerhub username by adding --username to the command.

Prepared Docker Images All evaluations in AInsteinBench are executed inside pre-built Docker containers to ensure environment consistency, reproducibility, and isolation across different scientific codebases.

We pre-build and publish all required Docker images on Docker Hub for both supported question types:

Einstein Toolkit questions

Images contain a fully configured Einstein Toolkit environment, including the required thorns, build system, and test dependencies.

Multi-SWE-bench questions

Images correspond to specific repositories and pull requests (e.g., PySCF, AMReX), with the codebase, dependencies, and test harness pre-installed.

Questions/Answers Preparation

Prepare the questions and answers in the required format. The answer filename should follow the corresponding "question_id". For EinsteinToolkit type questions, the answer should be one file of C/C++/Fortran code. For Multi-SWE-bench type questions, the answer should be a patch file.

As reference, once you download the questions in data/questions folder, you can extract the answers by running python scripts/extract_answers.py. The answers will be written to data/answers folder.

As an example, the reference answer to the question MSB_pyscf_pyscf_pr2373

diff --git a/pyscf/tdscf/rhf.py b/pyscf/tdscf/rhf.py
index b1b680b69f..d2cb63e086 100644
--- a/pyscf/tdscf/rhf.py
+++ b/pyscf/tdscf/rhf.py
@@ -530,16 +530,20 @@ def _charge_center(mol):
return numpy.einsum('z,zr->r', charges, coords)/charges.sum()

def _contract_multipole(tdobj, ints, hermi=True, xy=None):
+ '''ints is the integral tensor of a spin-independent operator'''
if xy is None: xy = tdobj.xy
+ nstates = len(xy)
+ pol_shape = ints.shape[:-2]
+ nao = ints.shape[-1]
+
+ if not tdobj.singlet:
+ return numpy.zeros((nstates,) + pol_shape)
+
mo_coeff = tdobj._scf.mo_coeff
mo_occ = tdobj._scf.mo_occ
orbo = mo_coeff[:,mo_occ==2]
orbv = mo_coeff[:,mo_occ==0]

- nstates = len(xy)
- pol_shape = ints.shape[:-2]
- nao = ints.shape[-1]
-
#Incompatible to old numpy version
#ints = numpy.einsum('...pq,pi,qj->...ij', ints, orbo.conj(), orbv)
ints = lib.einsum('xpq,pi,qj->xij', ints.reshape(-1,nao,nao), orbo.conj(), orbv)

You can use your favorite agent to work on questions and provide answers. We provide a minimal working agent in scripts/run_agent.py. You can run it by:

python scripts/run_agent.py \
--question-file data/questions/msb_converted.jsonl \
--api-key $OPENAI_API_KEY \
--output-dir outputs/

Run Evaluations

Evaluate the answer by running python evaluate_questions.py --questions --answers with your questions file and answers directory. For example, to evaluate the example questions in data/questions and the reference answers in data/answers, run:

# Evaluate Einstein Toolkit questions
python evaluate_questions.py \
--questions data/questions/et_converted.jsonl \
--answers data/answers \
--output data/eval/et_eval.json

# Evaluate Multi-SWE-bench questions
python evaluate_questions.py \
--questions data/questions/msb_converted.jsonl \
--answers data/answers \
--output data/eval/msb_eval.json

Repository Structure

AInsteinBench/
├── evaluate_questions.py # Unified evaluator
├── ainsteinbench/
│ ├── question.py # Question class
│ └── utils/ # Utility modules
├── curation/
│ ├── et/ # ET data curation
│ │ ├── ET_evaluator.py # Original ET evaluator
│ │ └── config_server.json # ET configuration
│ └── msb/ # MSB data curation
│ ├── log_parser.py # log parser
│ ├── AMReXCodes/ # AMReX questions
│ └── pyscf/ # PySCF questions
├── scripts/
├── data/
│ ├── questions/ # Unified format questions
│ ├── answers/ # Reference answers
│ ├── demo/ # Demo questions (non-runnable)
│ └── raw/ # Raw datasets
└── tests/
├── test_question.py # Question class tests
└── test_config.py # Config tests

Data Curation

Question Format

Both question types share a common structure with various question "content", "environment", "answer",…

Excerpt shown — open the source for the full document.

Notability

notability 1.0/10

New repo, very low traction