What does this fork signal mean?

Nous Research forked NousResearch/eqbench3 (forked from EQ-bench/eqbench3). This fork signal points to upstream code the lab may be inspecting, patching, or building on. High-signal details: repo NousResearch/eqbench3 · parent EQ-bench/eqbench3 · Trivial fork with minimal traction. onlylabs links this event to 1 captured evidence page and 6 related fork signals.

Nous Research Fork: NousResearch/eqbench3

Captured source

source ↗

GitHub/github.com/NousResearch/eqbench3

NousResearch/eqbench3 repository metadata

Source ↗

published Aug 11, 2025seen Jun 6captured Jun 11http 200method plain

NousResearch/eqbench3

License: MIT

Stars: 3

Forks: 1

Open issues: 0

Created: 2025-08-11T06:51:26Z

Pushed: 2025-08-12T06:24:59Z

Default branch: main

Fork: yes

Parent repository: EQ-bench/eqbench3

Archived: no

README:

EQ-Bench 3

EQ-Bench 3 is a multi-turn emotional intelligence benchmark. It assesses active EQ skills, interpersonal skills, psychological insight and analytical depth. It challenges language models with role-play or analysis tasks that require empathy, depth of insight, social dexterity, and more. An auxiliary judge model (by default, Claude Sonnet 3.7) scores or pairwise-compares the outputs.

For full details on the benchmark including methodology, criteria, bias analysis, repeatabilty experiments and more, click here.

Features

Role-Play Scenarios: The tested model is placed in conversation-based scenarios (e.g., parenting, relationship conflict, workplace tension). It must articulate what it (and others) feel/think before delivering its final response.
Analysis Scenarios: The model is asked to read or interpret a transcript and perform an in-depth analysis of human dynamics and subtext.
Rubric Scoring: A judge LLM assigns a multi-criteria rubric score (0–20, scaled up to 0–100) to each scenario outcome, focusing on empathy, emotional reasoning, social skill, etc.
Pairwise ELO Analysis: The judge compares transcripts from two different models for the same scenario, awarding a “win” margin for each of several criteria. A TrueSkill/ELO-like solver aggregates all pairwise comparisons to produce a final ranking.

Leaderboard

!image

The full EQ-Bench leaderboard is viewable at: https://eqbench.com/

1. [Project Overview](#project-overview) 2. [Installation & Requirements](#installation--requirements) 3. [Quickstart](#quickstart) 4. [Running the Benchmark](#running-the-benchmark)

[Command-Line Arguments](#command-line-arguments)
[Example Commands](#example-commands)
[Rubric vs ELO](#rubric-vs-elo)
[Local vs Canonical Results Files](#local-vs-canonical-results-files)

5. [Example Use Case: Compare a single model to a baseline with Elo](#5-example-use-case-compare-a-single-model-to-a-baseline-with-elo) 6. [Merging Local Results into the Canonical Leaderboard](#merging-local-results-into-the-canonical-leaderboard) 7. [Folder and File Structure](#folder-and-file-structure) 8. [Limitations and Notes](#limitations-and-notes) 9. [Contact and License](#contact-and-license) 10. [Citation](#citation)

---

1. Project Overview

EQ-Bench 3 aims to measure active emotional intelligence abilities in LLMs. Rather than knowledge-based or short-answer questions, tasks here are multi-turn dialogues or analysis questions that test empathy, social dexterity, and psychological insight. The evaluated model’s responses are then graded by a judge model:

Rubric pass: The judge model issues a numerical score for each scenario.
ELO pass: The judge model performs pairwise comparisons of transcripts from different models, resulting in an overall ELO ranking (via TrueSkill).

Key Points

Scenarios vary from relationship drama to conflict mediation, pushing the tested model to reason about others’ emotions.
Analysis tasks require deeper reflection on a provided transcript or scenario.
Judge model: By default, a Claude model (Sonnet 3.7) is used, but any LLM accessible via an OpenAI-compatible endpoint can serve as judge.
Truncation: Pairwise judgments truncate outputs to level the playing field. Rubric judgments typically do not truncate (to preserve detail).

---

2. Installation & Requirements

1. Clone this repository:

git clone https://github.com/EQ-bench/eqbench3.git
cd eqbench3

2. (Optional) Create a virtual environment and activate it:

python -m venv venv
source venv/bin/activate

3. Install dependencies:

pip install -r requirements.txt

4. Configure your API keys in .env:

cp .env.example .env
# then edit .env with your API keys, e.g.:
# TEST_API_KEY=sk-...
# JUDGE_API_KEY=sk-...

TEST_API_KEY & TEST_API_URL are used when calling the tested model.
JUDGE_API_KEY & JUDGE_API_URL are used by the judge model.

---

3. Quickstart

1. Run a single iteration of EQ-Bench (Rubric Scoring only, no Elo):

python eqbench3.py \
--test-model openai/gpt-4.1-mini \
--model-name gpt-4.1-mini-demo-run \
--judge-model anthropic/claude-3.7-sonnet \
--no-elo \
--iterations 1

Runs 1 iteration of every scenario, scoring them with the rubric.
Data is recorded in eqbench3_runs.json, and results displayed to console.

3. Full benchmark run with Elo to place the model on the leaderboard:

python eqbench3.py \
--test-model openai/gpt-4.1-mini \
--model-name my-gpt4-run \
--judge-model anthropic/claude-3.7-sonnet

After the roleplay scenarios are completed, it does a multi-stage pairwise pass (comparing the evaluated model to known models in local+leaderboard data).
Matchup results & ELO rating is stored in elo_results_eqbench3.json, and displayed to console.

---

4. Running the Benchmark

You interact with the main script [eqbench3.py](./eqbench3.py). It orchestrates:

Roleplay scenarios (multi-turn, plus self-debrief).
Rubric scoring (judge LLM reads final transcripts and grades on several criteria).
ELO analysis (judge LLM does pairwise comparisons, then solves rating with TrueSkill).

Command-Line Arguments

| Argument | Description | |------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------| | --test-model (required) | API model identifier for the tested model (e.g. openai/gpt-4.1-mini). | | --model-name | Logical name used for storing ELO data (defaults to --test-model if not supplied). It must be unique to avoid collisions in ELO data. | | --judge-model | Model to be used as the judge for ELO and/or rubric scoring. If --no-elo and --no-rubric are both set, you can skip this. | | --runs-file | Local runs data file (.json or .json.gz) storing scenario transcripts & statuses. Default: eqbench3_runs.json. | | --elo-results-file | Local ELO data file for storing pairwise...

Excerpt shown — open the source for the full document.

Notability

notability 1.0/10

Trivial fork with minimal traction