What does this fork signal mean?

Nous Research forked NousResearch/lm-evaluation-harness-pretraining (forked from EleutherAI/lm-evaluation-harness). This fork signal points to upstream code the lab may be inspecting, patching, or building on. High-signal details: repo NousResearch/lm-evaluation-harness-pretraining · parent EleutherAI/lm-evaluation-harness · Low-star fork, routine event.. onlylabs links this event to 1 captured evidence page and 6 related fork signals.

Nous Research Fork: NousResearch/lm-evaluation-harness-pretraining

Captured source

source ↗

GitHub/github.com/NousResearch/lm-evaluation-harness-pretraining

NousResearch/lm-evaluation-harness-pretraining repository metadata

Source ↗

published Jan 8, 2026seen Jun 6captured Jun 11http 200method plain

NousResearch/lm-evaluation-harness-pretraining

Description: A framework for few-shot evaluation of language models.

License: MIT

Stars: 4

Forks: 1

Open issues: 0

Created: 2026-01-08T21:22:06Z

Pushed: 2026-01-09T18:39:26Z

Default branch: main

Fork: yes

Parent repository: EleutherAI/lm-evaluation-harness

Archived: no

README:

Language Model Evaluation Harness

![DOI](https://doi.org/10.5281/zenodo.10256836)

---

Latest News 📣

[2025/12] CLI refactored with subcommands (run, ls, validate) and YAML config file support via --config. See the [CLI Reference](./docs/interface.md) and [Configuration Guide](./docs/config_files.md).
[2025/12] Lighter install: Base package no longer includes transformers/torch. Install model backends separately: pip install lm_eval[hf], lm_eval[vllm], etc.
[2025/07] Added think_end_token arg to hf (token/str), vllm and sglang (str) for stripping CoT reasoning traces from models that support it.
[2025/03] Added support for steering HF models!
[2025/02] Added SGLang support!
[2024/09] We are prototyping allowing users of LM Evaluation Harness to create and evaluate on text+image multimodal input, text output tasks, and have just added the hf-multimodal and vllm-vlm model types and mmmu task as a prototype feature. We welcome users to try out this in-progress feature and stress-test it for themselves, and suggest they check out `lmms-eval`, a wonderful project originally forking off of the lm-evaluation-harness, for a broader range of multimodal tasks, models, and features.
[2024/07] [API model](docs/API_guide.md) support has been updated and refactored, introducing support for batched and async requests, and making it significantly easier to customize and use for your own purposes. To run Llama 405B, we recommend using VLLM's OpenAI-compliant API to host the model, and use the `local-completions` model type to evaluate the model.
[2024/07] New Open LLM Leaderboard tasks have been added ! You can find them under the [leaderboard](lm_eval/tasks/leaderboard/README.md) task group.

---

Announcement

A new v0.4.0 release of lm-evaluation-harness is available !

New updates and features include:

New Open LLM Leaderboard tasks have been added ! You can find them under the [leaderboard](lm_eval/tasks/leaderboard/README.md) task group.
Internal refactoring
Config-based task creation and configuration
Easier import and sharing of externally-defined task config YAMLs
Support for Jinja2 prompt design, easy modification of prompts + prompt imports from Promptsource
More advanced configuration options, including output post-processing, answer extraction, and multiple LM generations per document, configurable fewshot settings, and more
Speedups and new modeling libraries supported, including: faster data-parallel HF model usage, vLLM support, MPS support with HuggingFace, and more
Logging and usability changes
New tasks including CoT BIG-Bench-Hard, Belebele, user-defined task groupings, and more

Please see our updated documentation pages in docs/ for more details.

Development will be continuing on the main branch, and we encourage you to give us feedback on what features are desired and how to improve the library further, or ask questions, either in issues or PRs on GitHub, or in the EleutherAI discord!

---

Overview

This project provides a unified framework to test generative language models on a large number of different evaluation tasks.

Features:

Over 60 standard academic benchmarks for LLMs, with hundreds of subtasks and variants implemented.
Support for models loaded via transformers (including quantization via GPTQModel and AutoGPTQ), GPT-NeoX, and Megatron-DeepSpeed, with a flexible tokenization-agnostic interface.
Support for fast and memory-efficient inference with vLLM.
Support for commercial APIs including OpenAI, and TextSynth.
Support for evaluation on adapters (e.g. LoRA) supported in HuggingFace's PEFT library.
Support for local models and benchmarks.
Evaluation with publicly available prompts ensures reproducibility and comparability between papers.
Easy support for custom prompts and evaluation metrics.

The Language Model Evaluation Harness is the backend for 🤗 Hugging Face's popular Open LLM Leaderboard, has been used in hundreds of papers, and is used internally by dozens of organizations including NVIDIA, Cohere, BigScience, BigCode, Nous Research, and Mosaic ML.

Install

To install the lm-eval package from the github repository, run:

git clone --depth 1 https://github.com/EleutherAI/lm-evaluation-harness
cd lm-evaluation-harness
pip install -e .

Installing Model Backends

The base installation provides the core evaluation framework. Model backends must be installed separately using optional extras:

For HuggingFace transformers models:

pip install "lm_eval[hf]"

For vLLM inference:

pip install "lm_eval[vllm]"

For API-based models (OpenAI, Anthropic, etc.):

pip install "lm_eval[api]"

Multiple backends can be installed together:

pip install "lm_eval[hf,vllm,api]"

A detailed table of all optional extras is available at the end of this document.

Basic Usage

Documentation

| Guide | Description | |-------|-------------| | [CLI Reference](./docs/interface.md) | Command-line arguments and subcommands | | [Configuration Guide](./docs/config_files.md) | YAML config file format and examples | | [Python API](./docs/python-api.md) | Programmatic usage with simple_evaluate() | |...

Excerpt shown — open the source for the full document.

Notability

notability 2.0/10

Low-star fork, routine event.