meituan-longcat/General365
Python
Captured source
source ↗meituan-longcat/General365
Description: This is the official repo for the paper "General365: Benchmarking General Reasoning in LLMs under High Difficulty and Diversity".
Language: Python
License: MIT
Stars: 79
Forks: 3
Open issues: 1
Created: 2026-04-08T03:47:55Z
Pushed: 2026-04-14T02:46:52Z
Default branch: main
Fork: no
Archived: no
README:
📃 Paper • 🌐 Project Page • 🏆 Leaderboard • 🤗 Dataset
📖 Introduction
We present General365, a highly challenging and diverse benchmark for evaluating the general reasoning capabilities in LLMs.
"General Reasoning" refers to reasoning tasks that depend exclusively on general knowledge. We define general knowledge as knowledge within the K-12 scope (such as common sense, fundamental linguistics, and basic subject matter), excluding university-level academic knowledge. Compared to domain-specific reasoning (e.g., Math Reasoning), general reasoning evaluation better decouples a model’s reasoning capability from its knowledge dependence. This enables a more precise assessment of reasoning skills rather than rote memorization, while testing the generalization of a model's reasoning abilities across broader scenarios. Current benchmarks for general reasoning face several challenges: a lack of difficulty, insufficient diversity, or overly synthetic characteristics. Consequently, we introduce General365, a manually curated benchmark characterized by high challenge and high diversity, aiming to facilitate more effective evaluation of reasoning capabilities in frontier models.
> To ensure the impartiality of the evaluation, we have released only half of the total questions. The remaining questions are maintained as a held-out test set to track potential data contamination within the open-source part.
🌟 Key Features
- High Diversity: It contains 365 manually crafted, highly diverse seed problems, specifically designed to cover a wide range of reasoning challenges and avoid repetitive features or patterns. By altering surface semantics or constraints while preserving core reasoning skills, these seed problems were further expanded into 1,095 variants.
- Challenging Boundaries: General365 covers 8 challenging categories, as detailed in Section 2.1 of paper. Even state-of-
the-art models barely achieve a "passing" level of performance on these challenging tasks.
- Focus on Reasoning over Knowledge: The knowledge required is strictly confined to the K-12 scope, ensuring the
dataset measures a model’s reasoning capabilities rather than knowledge retrieval.
- Rigorous Quality Control: All instances have undergone manual review to ensure the highest standards of quality.
- Accurate Scoring: We implemented a hybrid scoring algorithm combining rule-based and model-based approaches,
achieving a manually verified scoring accuracy of 99.6%.
🏆 Leaderboard
📊 Main Results
🛠️ Quick Start
Installation
Clone the repository:
git clone https://github.com/meituan-longcat/General365.git cd General365
Install dependencies:
pip install -r requirements.txt
Running evaluations
Step 1: Prepaer the Model Response File
After obtaining model responses, format them as follows (one JSON object per line):
{"question_id": 1, "model_response": "..."}
{"question_id": 2, "model_response": "..."}
...Save this file in the ./model_responses/ directory.
Step 2: Grading Responses
Set your API key and URL in lines 10-11 of grading.py. Then run:
python grading.py --response_file example_responses.jsonl
Evaluation results will be saved under the ./grading_results/ directory.
🔎 Citation
If you find our work helpful or relevant to your research, please kindly cite our paper:
@misc{general365benchmark,
title={General365: Benchmarking General Reasoning in Large Language Models Across Diverse and Challenging Tasks},
author={Junlin Liu and Shengnan An and Shuang Zhou and Dan Ma and Shixiong Luo and Ying Xie and Yuan Zhang and Wenling Yuan and Yifan Zhou and Xiaoyu Li and Ziwen Wang and Xuezhi Cao and Xunliang Cai},
year={2026},
eprint={2604.11778},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2604.11778},
}🤗 Acknowledgement
The evaluation script utilizes Math-Verify to parse and verify model outputs. We greatly appreciate the contributors' efforts in providing this valuable tool.
📜 License
This project is licensed under the MIT License - see the [LICENSE](./LICENSE) file for details.
📪 Support
For questions and support, please open an issue on GitHub or contact the maintainers.
Notability
notability 5.0/10New repo, moderate stars.