fw-ai/alpaca_eval
forked from tatsu-lab/alpaca_eval
Captured source
source ↗fw-ai/alpaca_eval
Description: An automatic evaluator for instruction-following language models. Human-validated, high-quality, cheap, and fast.
Language: Jupyter Notebook
License: Apache-2.0
Stars: 0
Forks: 0
Open issues: 0
Created: 2024-07-26T16:11:21Z
Pushed: 2024-07-26T17:34:43Z
Default branch: main
Fork: yes
Parent repository: tatsu-lab/alpaca_eval
Archived: no
README:
AlpacaEval : An Automatic Evaluator for Instruction-following Language Models
AlpacaEval 2.0 with length-controlled win-rates (paper) has a spearman correlation of 0.98 with ChatBot Arena while costing less than $10 of OpenAI credits run and running in less than 3 minutes. Our goal is to have a benchmark for chat LLMs that is: fast (
---
Updates:
:tada: Length-controlled Win Rates are out and used by default! This increases the correlation with ChatBot Arena from 0.93 to 0.98, while significantly decreasing length gameability. The raw win rates are still shown on the website and the CLI. More details [here](#length-controlled-win-rates).
:tada: AlpacaEval 2.0 is out and used by default! We improved the auto-annotator (better and cheaper) and use GPT-4 preview as baseline. More details [here](#alpacaeval-20). For the old version, set your environment variable IS_ALPACA_EVAL_2=False.
---
Table of Contents
1. [Overview](#overview) 2. [Quick Start](#quick-start) 2. [Leaderboards and how to interpret them](#leaderboards-and-how-to-interpret-them)
- [Models](#models)
- [Evaluators](#evaluators)
3. [Use-cases](#use-cases)
- [Evaluating a model](#evaluating-a-model)
- [Making a new leaderboard](#making-a-new-leaderboard)
- [Making a new evaluator](#making-a-new-evaluator)
4. [Contributing](#contributing)
- [Contributing a model](#contributing-a-model)
- [Contributing an evaluator](#contributing-an-evaluator)
- [Contributing an eval set](#contributing-an-eval-set)
- [Contributing a completion function](#contributing-a-completion-function)
5. [Limitations](#limitations) 6. [Analysis](#additional-analysis-and-plots)
- [Analyzing an evaluator](#analyzing-an-evaluator)
- [Analyzing an eval set](#analyzing-an-eval-set)
7. [Citation](#citation) 8. [Additional information](#additional-information)
- [Length-controlled win rates](#length-controlled-win-rates)
- [AlpacaEval 2.0](#alpacaeval-20)
- [Data Release](#data-release)
- [Differences with AlpacaFarm](#differences-with-alpacafarm)
- [Related work](#related-work)
- [Interpreting annotations](#interpreting-annotations)
- [Major updates](#major-updates)
Overview
Evaluation of instruction-following models (e.g., ChatGPT) typically requires human interactions. This is time-consuming, expensive, and hard to replicate. AlpacaEval in an LLM-based automatic evaluation that is fast, cheap, replicable, and validated against 20K human annotations. It is particularly useful for model development. Although we improved over prior automatic evaluation pipelines, there are still fundamental [limitations](#limitations) like the preference for longer outputs. AlpacaEval provides the following:
- **Leaderboard**: a leaderboard of common models on the AlpacaEval
evaluation set. Caution: Automatic evaluators (e.g. GPT-4) may be biased towards models that generate longer outputs and/or that were fine-tuned on the model underlying the evaluator (e.g. GPT-4).
- [Automatic evaluator](#evaluators): an automatic evaluator that has high agreement with humans (validated on 20K
annotations). We evaluate a model by measuring the fraction of times a powerful LLM (e.g. GPT-4) prefers the outputs from that model over outputs from a reference model. Our evaluators enable caching and output randomization by default.
- [Toolkit for building automatic evaluators](#analysis): a simple interface for
building advanced automatic evaluators (e.g. with caching, batching, or multi-annotators) and analyzing them (quality, price, speed, statistical power, bias, variance etc).
- [Human evaluation data](#data-release): 20K human preferences between a given and reference model
on the AlpacaFarm evaluation set. 2.5K of these are cross-annotations (4 humans annotating the same 650 examples).
- **AlpacaEval dataset**: a simplification
of AlpacaFarm's evaluation set, where "instructions" and "inputs" are merged into one field, and reference outputs are longer. [Details here](#data-release).
When to use and not use AlpacaEval?
When to use AlpacaEval? Our automatic evaluator is a quick and cheap proxy for human evaluation of simple instruction-following tasks. It is useful if you have to run many evaluations quickly, e.g., during model development.
When not to use AlpacaEval? As any other automatic evaluator, AlpacaEval should not replace human evaluation in high-stake decision-making, e.g., to decide on model release. In particular, AlpacaEval is limited by the fact that (1) the instructions in the eval set might not be representative of advanced usage of LLMs; (2) automatic evaluators may have biases such as favoring style over factuality of the answer; and (3) AlpacaEval does not measure the risks that a model could cause. Details in [limitations](#limitations).
Quick Start
To install the stable release, run
pip install alpaca-eval
To install the nightly version, run
pip install git+https://github.com/tatsu-lab/alpaca_eval
Then you can use it as follows:
export OPENAI_API_KEY= # for more complex configs, e.g. using Azure or switching clients see client_configs/README.md alpaca_eval --model_outputs 'example/outputs.json'
This will print the leaderboard to the console, and save both the leaderboard and the annotations to the same directory as the model_outputs file. Important parameters are the following:
- model_outputs : A path to a json file for the outputs of the model to add to the leaderboard. Each dictionary
should contain the keys instruction and output.
- annotators_config: This is the annotator to use. We recommend using
weighted_alpaca_eval_gpt4_turbo(
default for AlpacaEval 2.0), which…
Excerpt shown — open the source for the full document.
Notability
notability 2.0/10Routine fork of evaluation repo