RepoZhipu AI (GLM)Zhipu AI (GLM)published Jan 16, 2025seen 5d

zai-org/ComplexFuncBench

Python

Open original ↗

Captured source

source ↗
published Jan 16, 2025seen 5dcaptured 15hhttp 200method plain

zai-org/ComplexFuncBench

Description: Complex Function Calling Benchmark.

Language: Python

Stars: 180

Forks: 31

Open issues: 8

Created: 2025-01-16T11:32:30Z

Pushed: 2025-01-20T03:04:21Z

Default branch: main

Fork: no

Archived: no

README:

Complex Function Calling Benchmark (ComplexFuncBench)

📄 Arxiv Paper • 🤗 HF Paper • 📊 Dataset

Table of Contents

  • [Introduction](#introduction)
  • [Leaderboard](#Leaderboard)
  • [Method](#Method)
  • [How to evaluate on ComplexFuncBench](#how-to-evaluate-on-complexfuncbench)
  • [Citation](#citation)

Introduction

Complex Function Calling Benchmark (ComplexFuncBench) is specillly designed for complex function calling evaluation. The ComplexFuncBench dataset encompass 1,000 complex function calling samples from five aspects: (1) Function calling with multi-step in single turn; (2) Function calling with user-provided constraints; (3) Function calling that requires parameter value reasoning from implicit information; (4) Function calling with long parameter values that exceed 500 tokens; and (5) Function calling with 128k long-context length.

![Complex Example](./resources/complex-example.png)

The difference between ComplexFuncBench and other function calling benchmarks is shown in the following table.

| | Real API Response | Multi-Step | Constraints | Parameter Value Reasoning | Long Parameter Reasoning | Long-Context | | :----------------: | :---------------: | :--------: | :---------: | :-----------------------: | :----------------------: | :----------: | | API-Bench | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | | ToolBench | ✅ | ✅ | ❌ | ❌ | ❌ | ❌ | | T-Eval | ✅ | ✅ | ❌ | ❌ | ❌ | ❌ | | BFCL | ❌ | ✅ | ❌ | ❌ | ✅ | ✅ | | Tool Sandbox | ❌ | ✅ | ❌ | ❌ | ❌ | ❌ | | ComplexFuncBench | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |

Leaderboard

| Model | Overall Success Rate | Overall Call Acc. | Completeness | Correctness | | :--------------------------- | :------------------: | :---------------: | :----------: | :---------: | | Claude-3.5-Sonnet (20241022) | 61.00 | 79.27 | 1.84 | 1.85 | | GPT-4o (2024-08-06) | 60.50 | 80.55 | 1.66 | 1.75 | | GLM-4-Long | 57.10 | 76.35 | 1.72 | 1.74 | | GPT-4-Turbo (2024-04-09) | 49.50 | 71.38 | 1.72 | 1.81 | | Claude-3.5-Haiku (20241022) | 45.80 | 69.50 | 1.79 | 1.71 | | Qwen2.5-72B | 40.10 | 58.32 | 1.80 | 1.75 | | Mistral Large 2 | 20.10 | 48.78 | 0.94 | 1.0 | | GLM-4-9B | 9.40 | 27.97 | 1.15 | 1.03 | | Qwen2.5-7B | 5.0 | 18.19 | 1.5 | 1.47 | | Llama-3.1-405B | 4.00 | 11.87 | 0.43 | 0.30 | | Llama-3.1-70B | 2.70 | 8.17 | 0.67 | 0.36 | | Llama-3.1-8B | 0.10 | 1.34 | 0.18 | 0.09 |

Method

Data Collection

The collection of the ComplexFuncBench dataset consists of three stages: coarse generation, fine-grained annotation, and generalization. The dataset contains 1,000 complex function-calling samples, which comprise 600 single-domain samples and 400 cross-domain samples.

![Data Collection](./resources/data-collection.png)

Automated Evaluation

The automated evaluation framework \texttt{ComplexEval} evaluates models' complex function calling ability and response generation ability simultaneously.

![Evaluation Pipeline](./resources/evaluation-pipeline.png)

How to evaluate on ComplexFuncBench

Preparation

First, download the repository and dataset. You can download the benchmarkd dataset through the HuggingFace datasets.

git clone https://github.com/THUDM/ComplexFuncBench.git
cd ComplexFuncBench

Then, install the dependencies.

pip install -r requirements.txt

Serve Model

  • For close source models, make sure the corresponding model API keys are included in your evironments .env . To enable response-based evaluation, you need to subscribe the Booking API from RapidAPI.
OPENAI_API_KEY=sk-XXXXXX

RAPID_API_KEY=
  • For open source models, you need to deploy your model via vLLM. Run the following command to serve the model. Take THUDM/glm-4-9b-chat for example:
vllm serve THUDM/glm-4-9b-chat --api-key token-abc123 --tensor-parallel-size 4 --gpu-memory-utilization 0.95 --max_model_len 131072 --trust-remote-code

Run Model Inference

python evaluation.py --model_name {model_name} --proc_num {proc_num}

Take gpt-4o-2024-08-06 and THUDM/glm-4-9b-chat for example,

python evaluation.py --model_name gpt-4o-2024-08-06 --proc_num 50
python evaluation.py --model_name THUDM/glm-4-9b-chat --proc_num 50 --vllm_url http://xx.xx.xx.xx:8000/v1

The evaluation results is saved in result/{model_name}

Export Results

python print_results.py --result_dir {result_dir}

Citation

If you find our work helpful for your research, please consider citing our work.

@misc{zhong2025complexfuncbench,
title={ComplexFuncBench: Exploring Multi-Step and Constrained Function Calling under Long-Context Scenario},
author={Lucen Zhong and Zhengxiao Du and Xiaohan Zhang and Haiyi Hu and Jie Tang},
year={2025},
eprint={2501.10132},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2501.10132},
}

Excerpt shown — open the source for the full document.

Notability

notability 6.0/10

Solid new benchmark for function calling.