ForkFriendliAIFriendliAIpublished Aug 18, 2023seen 5d

friendliai/lm-evaluation-harness

forked from EleutherAI/lm-evaluation-harness

Open original ↗

Captured source

source ↗
published Aug 18, 2023seen 5dcaptured 15hhttp 200method plain

friendliai/lm-evaluation-harness

Description: A framework for few-shot evaluation of autoregressive language models.

Language: Python

License: MIT

Stars: 1

Forks: 0

Open issues: 0

Created: 2023-08-18T07:03:21Z

Pushed: 2025-01-02T01:29:12Z

Default branch: periflow-eval-master

Fork: yes

Parent repository: EleutherAI/lm-evaluation-harness

Archived: no

README:

PeriFlow evaluation harness

Overview

  • We only support subset of lm-evaluation-harness. Please check [periflow-supported-task-table](periflow_supported_task_table.md).

Install

To install the lm-eval refactor branch from the github repository, run:

git clone https://github.com/friendliai/lm-evaluation-harness.git
cd lm-evaluation-harness
pip install -e .

To install additional multilingual tokenization and text segmentation packages, you must install the package with the multilingual extra:

pip install -e ".[multilingual]"

To install the package with all extras, run

pip install -e ".[all]"

Evaluation Command

1. Sequential Request Evaluation

python main.py \
--model periflow \
--model_args model_name_or_path {model_name_of_path from huggingface hub},req_url={engine request url} \
--tasks {evaluation tas} \
--num_fewshot {number of fewshot samples} # optional. without this option, num_fewshot=0
--no_cache # optional. without this option, if evaluation result is existing, then skip the evaluation process, return cached results.
--average_acc_tasks # only for mmlu tasks. In lm-evaluation-harness, mmlu dataset contains a lots of seperated datasets. Using this option, the average acc of all seperated datsets is added in result table.

2. Async Request Evaluation

python main.py \
--model periflow_async \
--model_args model_name_or_path {model_name_of_path from huggingface hub},req_url={engine request url} \
--tasks {evaluation tas} \
--num_fewshot {number of fewshot samples} # optional. without this option, num_fewshot=0
--no_cache # optional. without this option, if evaluation result is existing, then skip the evaluation process, return cached results.
--average_acc_tasks # only for mmlu tasks. In lm-evaluation-harness, mmlu dataset contains a lots of seperated datasets. Using this option, the average acc of all seperated datsets is added in result table.

Evaluation Result

1. Sequential Request Evaluation Result

|Model|ARC acc_norm|Hellaswag acc_norm|MMLU average acc of all|TruthfulQA mc2| |---|---|---|---|---| |Llama-2-7b-chat-hf|52.65|78.52|48.10|45.31|

2. Async Request Evaluation Result

|Model|ARC acc_norm|Hellaswag acc_norm|MMLU average acc of all|TruthfulQA mc2| |---|---|---|---|---| |Llama-2-7b-chat-hf|52.82|78.51|48.20|45.32|