What does this repo signal mean?

Meituan (LongCat) published meituan-longcat/General365 (Python). This repository signal exposes tooling, eval, infrastructure, or model-adjacent work before it may appear in a launch post. High-signal details: repo meituan-longcat/General365 · language Python · General-purpose language model by Meituan. onlylabs links this event to 1 captured evidence page and 6 related repo signals.

Meituan (LongCat) Repo: meituan-longcat/General365

Captured source

source ↗

GitHub/github.com/meituan-longcat/General365

meituan-longcat/General365 repository metadata

Source ↗

published Apr 8, 2026seen Jun 5captured Jun 11http 200method plain

meituan-longcat/General365

Description: This is the official repo for the paper "General365: Benchmarking General Reasoning in LLMs under High Difficulty and Diversity".

Language: Python

License: MIT

Stars: 79

Forks: 3

Open issues: 1

Created: 2026-04-08T03:47:55Z

Pushed: 2026-04-14T02:46:52Z

Default branch: main

Fork: no

Archived: no

README:

📃 Paper • 🌐 Project Page • 🏆 Leaderboard • 🤗 Dataset

📖 Introduction

We present General365, a highly challenging and diverse benchmark for evaluating the general reasoning capabilities in LLMs.

"General Reasoning" refers to reasoning tasks that depend exclusively on general knowledge. We define general knowledge as knowledge within the K-12 scope (such as common sense, fundamental linguistics, and basic subject matter), excluding university-level academic knowledge. Compared to domain-specific reasoning (e.g., Math Reasoning), general reasoning evaluation better decouples a model’s reasoning capability from its knowledge dependence. This enables a more precise assessment of reasoning skills rather than rote memorization, while testing the generalization of a model's reasoning abilities across broader scenarios. Current benchmarks for general reasoning face several challenges: a lack of difficulty, insufficient diversity, or overly synthetic characteristics. Consequently, we introduce General365, a manually curated benchmark characterized by high challenge and high diversity, aiming to facilitate more effective evaluation of reasoning capabilities in frontier models.

> To ensure the impartiality of the evaluation, we have released only half of the total questions. The remaining questions are maintained as a held-out test set to track potential data contamination within the open-source part.

🌟 Key Features

High Diversity: It contains 365 manually crafted, highly diverse seed problems, specifically designed to cover a wide range of reasoning challenges and avoid repetitive features or patterns. By altering surface semantics or constraints while preserving core reasoning skills, these seed problems were further expanded into 1,095 variants.
Challenging Boundaries: General365 covers 8 challenging categories, as detailed in Section 2.1 of paper. Even state-of-

the-art models barely achieve a "passing" level of performance on these challenging tasks.

Focus on Reasoning over Knowledge: The knowledge required is strictly confined to the K-12 scope, ensuring the

dataset measures a model’s reasoning capabilities rather than knowledge retrieval.

Rigorous Quality Control: All instances have undergone manual review to ensure the highest standards of quality.
Accurate Scoring: We implemented a hybrid scoring algorithm combining rule-based and model-based approaches,

achieving a manually verified scoring accuracy of 99.6%.

🏆 Leaderboard

📊 Main Results

🛠️ Quick Start

Installation

Clone the repository:

git clone https://github.com/meituan-longcat/General365.git
cd General365

Install dependencies:

pip install -r requirements.txt

Running evaluations

Step 1: Prepaer the Model Response File

After obtaining model responses, format them as follows (one JSON object per line):

{"question_id": 1, "model_response": "..."}
{"question_id": 2, "model_response": "..."}
...

Save this file in the ./model_responses/ directory.

Step 2: Grading Responses

Set your API key and URL in lines 10-11 of grading.py. Then run:

python grading.py --response_file example_responses.jsonl

Evaluation results will be saved under the ./grading_results/ directory.

🔎 Citation

If you find our work helpful or relevant to your research, please kindly cite our paper:

@misc{general365benchmark,
title={General365: Benchmarking General Reasoning in Large Language Models Across Diverse and Challenging Tasks},
author={Junlin Liu and Shengnan An and Shuang Zhou and Dan Ma and Shixiong Luo and Ying Xie and Yuan Zhang and Wenling Yuan and Yifan Zhou and Xiaoyu Li and Ziwen Wang and Xuezhi Cao and Xunliang Cai},
year={2026},
eprint={2604.11778},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2604.11778},
}

🤗 Acknowledgement

The evaluation script utilizes Math-Verify to parse and verify model outputs. We greatly appreciate the contributors' efforts in providing this valuable tool.

📜 License

This project is licensed under the MIT License - see the [LICENSE](./LICENSE) file for details.

📪 Support

For questions and support, please open an issue on GitHub or contact the maintainers.

Notability

notability 5.0/10

New repo, moderate stars.