siliconflow/LiveCodeBench
forked from LiveCodeBench/LiveCodeBench
Captured source
source ↗siliconflow/LiveCodeBench
Description: Official repository for the paper "LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code"
License: MIT
Stars: 0
Forks: 0
Open issues: 1
Created: 2026-06-01T07:15:20Z
Pushed: 2026-06-01T07:40:01Z
Default branch: main
Fork: yes
Parent repository: LiveCodeBench/LiveCodeBench
Archived: no
README:
LiveCodeBench
Official repository for the paper "LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code"
🏠 Home Page • 💻 Data • 🏆 Leaderboard • 🔍 Explorer
Introduction
LiveCodeBench provides holistic and contamination-free evaluation of coding capabilities of LLMs. Particularly, LiveCodeBench continuously collects new problems over time from contests across three competition platforms -- LeetCode, AtCoder, and CodeForces. Next, LiveCodeBench also focuses on a broader range of code-related capabilities, such as self-repair, code execution, and test output prediction, beyond just code generation. Currently, LiveCodeBench hosts four hundred high-quality coding problems that were published between May 2023 and March 2024.
Installation
You can clone the repository using the following command:
git clone https://github.com/LiveCodeBench/LiveCodeBench.git cd LiveCodeBench
We recommend using uv for managing dependencies, which can be installed a number of ways.
Verify that uv is installed on your system by running:
uv --version
Once uv has been installed, use it to create a virtual environment for LiveCodeBench and install its dependencies with the following commands:
uv venv --python 3.11 source .venv/bin/activate uv pip install -e .
Data
We provide a benchmark for different code capability scenarios
Inference and Evaluation
Dataset Versions
Since LiveCodeBench is a continuously updated benchmark, we provide different versions of the dataset. Particularly, we provide the following versions of the dataset:
release_v1: The initial release of the dataset with problems released between May 2023 and Mar 2024 containing 400 problems.release_v2: The updated release of the dataset with problems released between May 2023 and May 2024 containing 511 problems.release_v3: The updated release of the dataset with problems released between May 2023 and Jul 2024 containing 612 problems.release_v4: The updated release of the dataset with problems released between May 2023 and Sep 2024 containing 713 problems.release_v5: The updated release of the dataset with problems released between May 2023 and Jan 2025 containing 880 problems.release_v6: The updated release of the dataset with problems released between May 2023 and Apr 2025 containing 1055 problems.
You can use the --release_version flag to specify the dataset version you wish to use. Particularly, you can use the following command to run the evaluation on the release_v2 dataset. Release version defaults to release_latest. Additionally, we have introduced fine-grained release versions such as v1, v2, v1_v3, v4_v5 for specific versions of the dataset.
python -m lcb_runner.runner.main --model {model_name} --scenario codegeneration --evaluate --release_version release_v2Code Generation
We use vllm for inference using open models. By default, we use tensor_parallel_size=${num_gpus} to parallelize inference across all available GPUs. It can be configured using the --tensor_parallel_size flag as required.
For running the inference, please provide the model_name based on the [./lcb_runner/lm_styles.py](./lcb_runner/lm_styles.py) file. The scenario (here codegeneration) can be used to specify the scenario for the model.
python -m lcb_runner.runner.main --model {model_name} --scenario codegenerationAdditionally, --use_cache flag can be used to cache the generated outputs and --continue_existing flag can be used to use the existing dumped results. In case you wish to use model from a local path, you can additionally provide --local_model_path flag with the path to the model. We use n=10 and temperature=0.2 for generation. Please check the [./lcb_runner/runner/parser.py](./lcb_runner/runner/parser.py) file for more details on the flags.
For closed API models, --multiprocess flag can be used to parallelize queries to API servers (adjustable according to rate limits).
Evaluation
We compute pass@1 and pass@5 metrics for model evaluations. We use a modified version of the checker released with the `apps` benchmark to compute the metrics. Particularly, we identified some unhandled edge cases in the original checker and fixed them and additionally simplified the checker based on our collected dataset. To run the evaluation, you can add the --evaluate flag:
python -m lcb_runner.runner.main --model {model_name} --scenario codegeneration --evaluateNote that time limits can cause slight (`
Next, we evaluate models on different code capabilities and find that relative performances of models do change over tasks (left). Thus, it highlights the need for holistic evaluation of LLMs for code.
We also find evidence of possible overfitting on HumanEval (right). Particularly, models that perform well on HumanEval do not necessarily perform well on LiveCodeBench. In the scatterplot above, we find the models get clustered into two groups, shaded in red and green. The red group contains models that perform well on HumanEval but poorly on LiveCodeBench, while the green group contains models that perform well on both.
For more details, please refer to our website at livecodebench.github.io.
Citation
@article{jain2024livecodebench,
author = {Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, Ion Stoica},
title = {LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code},
year = {2024},
journal =…Excerpt shown — open the source for the full document.
Notability
notability 2.0/10Routine fork, low traction