What does this repo signal mean?

StreamLake (Kuaishou) published kwaipilot/SWE-Compass (Python). This repository signal exposes tooling, eval, infrastructure, or model-adjacent work before it may appear in a launch post. High-signal details: repo kwaipilot/SWE-Compass · language Python · New repo, low stars, routine.. onlylabs links this event to 1 captured evidence page and 2 related repo signals.

StreamLake (Kuaishou) Repo: kwaipilot/SWE-Compass

Captured source

source ↗

GitHub/github.com/kwaipilot/SWE-Compass

kwaipilot/SWE-Compass repository metadata

Source ↗

published Dec 3, 2025seen Jun 5captured Jun 11http 200method plain

kwaipilot/SWE-Compass

Language: Python

License: Apache-2.0

Stars: 18

Forks: 2

Open issues: 3

Created: 2025-12-03T07:47:56Z

Pushed: 2026-03-28T10:37:57Z

Default branch: main

Fork: no

Archived: no

README:

[🇺🇸 English ](README.md) [🇨🇳 简体中文](README_CN.md)

---

🧠 SWE-Compass: Towards Unified Evaluation of Agentic Coding Abilities for Large Language Models

Current evaluations of LLMs for software engineering are limited by a narrow range of task categories, a Python-centric bias, and insufficient alignment with real-world development workflows. To bridge these gaps, SWECompass establishes a high-coverage, multi-dimensional, and production-aligned evaluation framework:

✨ Covers 8 software engineering task types, 8 programming scenarios, and 10 programming languages
✨ Contains 2000 high-quality instances sourced from real GitHub pull requests
✨ Supports multi-dimensional performance comparison across task types, languages, and scenarios

By integrating heterogeneous code tasks with real engineering practices, SWECompass provides a reproducible, rigorous, and production-oriented benchmark for diagnosing and improving the software engineering capabilities of large language models.

---

✨ Key Features

⚙️ Automated Docker-based evaluation environment
📦 Multi-project, multi-task, multi-language
🤖 Supports execution and evaluation of model-generated patches
📊 Multi-dimensional performance metrics: task type, scenario, language
🌟 Optional integration with an LLM judge for code understanding tasks
🔄 Highly reproducible, designed for research and production applications

---

📦 1. Environment Setup

1.1 Install Docker

Refer to the official documentation: https://docs.docker.com/engine/install/

1.2 Install Python 3.11 and Dependencies

Enter the project directory and run:

cd swe-compass
pip install -e .
pip install -r requirements.txt

---

🐳 2. Download Required Docker Images and Supplementary Data

Enter the project directory and run:

cd swe-compass
bash pull_docker.sh
python download_all_data.py

The scripts will automatically download the evaluation environment from DockerHub.

---

📄 3. Prepare Prediction Data

You need to prepare a JSON file that maps each instance_id to its corresponding patch and metadata.

Example format (see swe-compass/data/example.json):

{
"": {
"model_name_or_path": "",
"instance_id": "",
"model_patch": ""
}
}

> Each prediction entry only requires three fields: > model_name_or_path, instance_id, model_patch

---

▶️ 4. Run Evaluation

4.1 Basic Command

cd swe-compass
python validation.py \
--dataset_name ./data/swecompass_all_2000.jsonl \
--predictions_path \
--max_workers \
--run_id \
--model_name \
--api_key \
--base_url \
--proxy

4.2 Example

python validation.py \
--dataset_name ./data/swecompass_all_2000.jsonl \
--predictions_path ./data/example.json \
--max_workers 10 \
--run_id test \
--model_name deepseek_v3 \
--api_key xxx \
--base_url xxx \
--proxy http ...

---

📊 5. Evaluation Outputs

---

5.1 Work Logs Directory

swe-compass/output/work//

Contains execution traces and logs for each instance.

---

5.2 Evaluation Results Directory

swe-compass/output/result//

Contains two files:

| File | Content | | ---------------- | ------------------------------------------------- | | raw_data.jsonl | Raw evaluation results for each instance | | result.json | Aggregated scores by task, language, and scenario |

---

⚙️ 6. Common Arguments

| Argument | Description | | -------------------- | ------------------------------ | | --dataset_name | Path to dataset | | --predictions_path | Model predictions JSON file | | --max_workers | Number of worker processes | | --run_id | Unique identifier for this run | | --model_name | Judge LLM model name | | --api_key | Judge LLM API key | | --base_url | Judge LLM API URL | | --proxy | Proxy address |

🤝 7. Contributions

We welcome contributions from the research community in NLP, Machine Learning, and Software Engineering. Researchers are encouraged to submit issues or pull requests that extend, evaluate, or refine the benchmark.

For collaboration or inquiries, please contact:

Xujingxuan — xujingxuan2002@163.com
Ken Deng — dengken@kuaishou.com
Jiaheng Liu — liujiaheng@nju.edu.cn

We appreciate constructive engagement and look forward to further improvements driven by the community.

📄 8. Citation

@article{xu2025SWECompass,
title={SWE-Compass: Towards Unified Evaluation of Agentic Coding Abilities for Large Language Models},
author={Xu, Jingxuan and Deng, Ken and Li, Weihao and Yu, Songwei etc},
journal={arXiv preprint arXiv:2511.05459},
year={2025}
}

Excerpt shown — open the source for the full document.

Notability

notability 3.0/10

New repo, low stars, routine.