ForkDeepInfraDeepInfrapublished Apr 11, 2024seen 5d

deepinfra/lm-evaluation-harness

forked from EleutherAI/lm-evaluation-harness

Open original ↗

Captured source

source ↗
published Apr 11, 2024seen 5dcaptured 15hhttp 200method plain

deepinfra/lm-evaluation-harness

Description: A framework for few-shot evaluation of language models.

License: MIT

Stars: 0

Forks: 0

Open issues: 0

Created: 2024-04-11T16:50:08Z

Pushed: 2024-04-29T19:46:52Z

Default branch: main

Fork: yes

Parent repository: EleutherAI/lm-evaluation-harness

Archived: no

README:

Language Model Evaluation Harness

![DOI](https://doi.org/10.5281/zenodo.10256836)

Announcement

A new v0.4.0 release of lm-evaluation-harness is available !

New updates and features include:

  • Internal refactoring
  • Config-based task creation and configuration
  • Easier import and sharing of externally-defined task config YAMLs
  • Support for Jinja2 prompt design, easy modification of prompts + prompt imports from Promptsource
  • More advanced configuration options, including output post-processing, answer extraction, and multiple LM generations per document, configurable fewshot settings, and more
  • Speedups and new modeling libraries supported, including: faster data-parallel HF model usage, vLLM support, MPS support with HuggingFace, and more
  • Logging and usability changes
  • New tasks including CoT BIG-Bench-Hard, Belebele, user-defined task groupings, and more

Please see our updated documentation pages in docs/ for more details.

Development will be continuing on the main branch, and we encourage you to give us feedback on what features are desired and how to improve the library further, or ask questions, either in issues or PRs on GitHub, or in the EleutherAI discord!

Overview

This project provides a unified framework to test generative language models on a large number of different evaluation tasks.

Features:

  • Over 60 standard academic benchmarks for LLMs, with hundreds of subtasks and variants implemented.
  • Support for models loaded via transformers (including quantization via AutoGPTQ), GPT-NeoX, and Megatron-DeepSpeed, with a flexible tokenization-agnostic interface.
  • Support for fast and memory-efficient inference with vLLM.
  • Support for commercial APIs including OpenAI, and TextSynth.
  • Support for evaluation on adapters (e.g. LoRA) supported in HuggingFace's PEFT library.
  • Support for local models and benchmarks.
  • Evaluation with publicly available prompts ensures reproducibility and comparability between papers.
  • Easy support for custom prompts and evaluation metrics.

The Language Model Evaluation Harness is the backend for 🤗 Hugging Face's popular Open LLM Leaderboard, has been used in hundreds of papers, and is used internally by dozens of organizations including NVIDIA, Cohere, BigScience, BigCode, Nous Research, and Mosaic ML.

Install

To install the lm-eval package from the github repository, run:

git clone https://github.com/EleutherAI/lm-evaluation-harness
cd lm-evaluation-harness
pip install -e .

We also provide a number of optional dependencies for extended functionality. A detailed table is available at the end of this document.

Basic Usage

Hugging Face transformers

To evaluate a model hosted on the HuggingFace Hub (e.g. GPT-J-6B) on hellaswag you can use the following command (this assumes you are using a CUDA-compatible GPU):

lm_eval --model hf \
--model_args pretrained=EleutherAI/gpt-j-6B \
--tasks hellaswag \
--device cuda:0 \
--batch_size 8

Additional arguments can be provided to the model constructor using the --model_args flag. Most notably, this supports the common practice of using the revisions feature on the Hub to store partially trained checkpoints, or to specify the datatype for running a model:

lm_eval --model hf \
--model_args pretrained=EleutherAI/pythia-160m,revision=step100000,dtype="float" \
--tasks lambada_openai,hellaswag \
--device cuda:0 \
--batch_size 8

Models that are loaded via both transformers.AutoModelForCausalLM (autoregressive, decoder-only GPT style models) and transformers.AutoModelForSeq2SeqLM (such as encoder-decoder models like T5) in Huggingface are supported.

Batch size selection can be automated by setting the ``--batch_size flag to auto. This will perform automatic detection of the largest batch size that will fit on your device. On tasks where there is a large difference between the longest and shortest example, it can be helpful to periodically recompute the largest batch size, to gain a further speedup. To do this, append :N to above flag to automatically recompute the largest batch size N`` times. For example, to recompute the batch size 4 times, the command would be:

lm_eval --model hf \
--model_args pretrained=EleutherAI/pythia-160m,revision=step100000,dtype="float" \
--tasks lambada_openai,hellaswag \
--device cuda:0 \
--batch_size auto:4

The full list of supported arguments are provided [here](./docs/interface.md), and on the terminal by calling lm_eval -h. Alternatively, you can use lm-eval instead of lm_eval.

> [!Note] > Just like you can provide a local path to transformers.AutoModel, you can also provide a local path to lm_eval via --model_args pretrained=/path/to/model

Multi-GPU Evaluation with Hugging Face accelerate

We support two main ways of using Hugging Face's accelerate 🚀 library for multi-GPU evaluation.

To perform *data-parallel evaluation* (where each GPU loads a separate full copy of the model), we leverage the accelerate launcher as follows:

accelerate launch -m lm_eval --model hf \
--tasks lambada_openai,arc_easy \
--batch_size 16

(or via accelerate launch --no-python lm_eval).

For cases where your model can fit on a single GPU, this allows you to evaluate on K GPUs K times faster than…

Excerpt shown — open the source for the full document.