NousResearch/eqbench3
forked from EQ-bench/eqbench3
Captured source
source ↗NousResearch/eqbench3
License: MIT
Stars: 3
Forks: 1
Open issues: 0
Created: 2025-08-11T06:51:26Z
Pushed: 2025-08-12T06:24:59Z
Default branch: main
Fork: yes
Parent repository: EQ-bench/eqbench3
Archived: no
README:
EQ-Bench 3
EQ-Bench 3 is a multi-turn emotional intelligence benchmark. It assesses active EQ skills, interpersonal skills, psychological insight and analytical depth. It challenges language models with role-play or analysis tasks that require empathy, depth of insight, social dexterity, and more. An auxiliary judge model (by default, Claude Sonnet 3.7) scores or pairwise-compares the outputs.
For full details on the benchmark including methodology, criteria, bias analysis, repeatabilty experiments and more, click here.
Features
- Role-Play Scenarios: The tested model is placed in conversation-based scenarios (e.g., parenting, relationship conflict, workplace tension). It must articulate what it (and others) feel/think before delivering its final response.
- Analysis Scenarios: The model is asked to read or interpret a transcript and perform an in-depth analysis of human dynamics and subtext.
- Rubric Scoring: A judge LLM assigns a multi-criteria rubric score (0–20, scaled up to 0–100) to each scenario outcome, focusing on empathy, emotional reasoning, social skill, etc.
- Pairwise ELO Analysis: The judge compares transcripts from two different models for the same scenario, awarding a “win” margin for each of several criteria. A TrueSkill/ELO-like solver aggregates all pairwise comparisons to produce a final ranking.
Leaderboard
The full EQ-Bench leaderboard is viewable at: https://eqbench.com/
Table of Contents
1. [Project Overview](#project-overview) 2. [Installation & Requirements](#installation--requirements) 3. [Quickstart](#quickstart) 4. [Running the Benchmark](#running-the-benchmark)
- [Command-Line Arguments](#command-line-arguments)
- [Example Commands](#example-commands)
- [Rubric vs ELO](#rubric-vs-elo)
- [Local vs Canonical Results Files](#local-vs-canonical-results-files)
5. [Example Use Case: Compare a single model to a baseline with Elo](#5-example-use-case-compare-a-single-model-to-a-baseline-with-elo) 6. [Merging Local Results into the Canonical Leaderboard](#merging-local-results-into-the-canonical-leaderboard) 7. [Folder and File Structure](#folder-and-file-structure) 8. [Limitations and Notes](#limitations-and-notes) 9. [Contact and License](#contact-and-license) 10. [Citation](#citation)
---
1. Project Overview
EQ-Bench 3 aims to measure active emotional intelligence abilities in LLMs. Rather than knowledge-based or short-answer questions, tasks here are multi-turn dialogues or analysis questions that test empathy, social dexterity, and psychological insight. The evaluated model’s responses are then graded by a judge model:
- Rubric pass: The judge model issues a numerical score for each scenario.
- ELO pass: The judge model performs pairwise comparisons of transcripts from different models, resulting in an overall ELO ranking (via TrueSkill).
Key Points
- Scenarios vary from relationship drama to conflict mediation, pushing the tested model to reason about others’ emotions.
- Analysis tasks require deeper reflection on a provided transcript or scenario.
- Judge model: By default, a Claude model (Sonnet 3.7) is used, but any LLM accessible via an OpenAI-compatible endpoint can serve as judge.
- Truncation: Pairwise judgments truncate outputs to level the playing field. Rubric judgments typically do not truncate (to preserve detail).
---
2. Installation & Requirements
1. Clone this repository:
git clone https://github.com/EQ-bench/eqbench3.git cd eqbench3
2. (Optional) Create a virtual environment and activate it:
python -m venv venv source venv/bin/activate
3. Install dependencies:
pip install -r requirements.txt
4. Configure your API keys in .env:
cp .env.example .env # then edit .env with your API keys, e.g.: # TEST_API_KEY=sk-... # JUDGE_API_KEY=sk-...
TEST_API_KEY&TEST_API_URLare used when calling the tested model.JUDGE_API_KEY&JUDGE_API_URLare used by the judge model.
---
3. Quickstart
1. Run a single iteration of EQ-Bench (Rubric Scoring only, no Elo):
python eqbench3.py \ --test-model openai/gpt-4.1-mini \ --model-name gpt-4.1-mini-demo-run \ --judge-model anthropic/claude-3.7-sonnet \ --no-elo \ --iterations 1
- Runs 1 iteration of every scenario, scoring them with the rubric.
- Data is recorded in
eqbench3_runs.json, and results displayed to console.
3. Full benchmark run with Elo to place the model on the leaderboard:
python eqbench3.py \ --test-model openai/gpt-4.1-mini \ --model-name my-gpt4-run \ --judge-model anthropic/claude-3.7-sonnet
- After the roleplay scenarios are completed, it does a multi-stage pairwise pass (comparing the evaluated model to known models in local+leaderboard data).
- Matchup results & ELO rating is stored in
elo_results_eqbench3.json, and displayed to console.
---
4. Running the Benchmark
You interact with the main script [eqbench3.py](./eqbench3.py). It orchestrates:
- Roleplay scenarios (multi-turn, plus self-debrief).
- Rubric scoring (judge LLM reads final transcripts and grades on several criteria).
- ELO analysis (judge LLM does pairwise comparisons, then solves rating with TrueSkill).
Command-Line Arguments
| Argument | Description | |------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------| | --test-model (required) | API model identifier for the tested model (e.g. openai/gpt-4.1-mini). | | --model-name | Logical name used for storing ELO data (defaults to --test-model if not supplied). It must be unique to avoid collisions in ELO data. | | --judge-model | Model to be used as the judge for ELO and/or rubric scoring. If --no-elo and --no-rubric are both set, you can skip this. | | --runs-file | Local runs data file (.json or .json.gz) storing scenario transcripts & statuses. Default: eqbench3_runs.json. | | --elo-results-file | Local ELO data file for storing pairwise…
Excerpt shown — open the source for the full document.
Notability
notability 1.0/10Trivial fork with minimal traction