reka-ai/research-eval
Python
Captured source
source ↗reka-ai/research-eval
Description: A benchmark to evaluate search-augmented LLMs
Language: Python
License: NOASSERTION
Stars: 17
Forks: 2
Open issues: 1
Created: 2025-08-27T19:52:31Z
Pushed: 2025-08-28T22:10:54Z
Default branch: main
Fork: no
Archived: no
README:
Research-Eval
A benchmark to evaluate search-augmented LLMs.
In addition to the dataset, the repository provides a simple evaluation script that supports most frontier models out of the box (including Reka, Google, OpenAI, Perplexity, Anthropic, XAI, and Mistral).
Check out our blogpost for more details.
Setup
Install dependencies as follows:
pip install -r requirements.txt
API keys are read from environment variables, set them as needed:
export REKA_API_KEY='YOUR_API_KEY' export FIREWORKS_API_KEY='YOUR_API_KEY' export GOOGLE_API_KEY='YOUR_API_KEY' export OPENAI_API_KEY='YOUR_API_KEY' export PERPLEXITY_API_KEY='YOUR_API_KEY' export ANTHROPIC_API_KEY='YOUR_API_KEY' export XAI_API_KEY='YOUR_API_KEY' export MISTRAL_API_KEY='YOUR_API_KEY'
We recommend setting all relevant environment variables in .env, which the evaluation code loads automatically.
Not all API keys are required, only set the relevant ones for the models you want to evaluate. As the only exception, FIREWORKS_API_KEY is always required for scoring, as it is used to run the recommended deepseek-v3-0324 judge.
Usage
1. Generate responses
python research-eval.py generate \ --api reka \ --model reka-flash-research \ --concurrency 20 \ -o experiments/reka-flash-research.out.jsonl
2. Judge responses
python research-eval.py judge \ -i experiments/reka-flash-research.out.jsonl \ -o experiments/reka-flash-research.eval.jsonl
This uses the deepseek-v3-0324 model through Fireworks by default, which is our recommended judge. While other judge models can be used, we strongly discourage publishing results obtained with them, as this would make numbers non-comparable.
3. Report results
python research-eval.py report \ -i experiments/reka-flash-research.eval.jsonl
Leaderboard
| Model | Score | Cost | |-------------------------------------------|-------|--------| | Reka Research | 59.1 | 25.00 | | Gemini 2.0 Flash (w/ Google Search) | 54.2 | 35.05 | | GPT-4o search preview | 53.0 | 37.14 | | Claude Sonnet 4 (w/ web search) | 44.8 | 162.01 | | Sonar Reasoning Pro | 44.6 | 14.75 | | Mistral Medium 2505 (w/ web search agent) | 31.5 | 32.51 | | Grok 4 (w/ live search) | 26.7 | 85.88 |
We report average accuracy across 5 runs. The cost refers to USD per 1,000 requests in the Research-Eval distribution.
Reproducing results
To reproduce the results in the leaderboard, run the following script:
./reproduce-results.sh
License
The content of this repository is licensed under a [Modified MIT License](LICENSE).
The dataset is distributed in encrypted form to prevent contamination, and our license prohibits redistributing it in decrypted or plain-text form.
Notability
notability 3.0/10Low stars, minor repo.