inclusionAI/SWE-CARE
Python
Captured source
source ↗inclusionAI/SWE-CARE
Language: Python
License: Apache-2.0
Stars: 15
Forks: 13
Open issues: 0
Created: 2025-10-10T07:36:39Z
Pushed: 2026-04-21T07:29:00Z
Default branch: main
Fork: no
Archived: no
README:
SWE-CARE: A Comprehensiveness-aware Benchmark for Evaluation of Code Review
A comprehensiveness-aware benchmark for repository-level CR evaluation.
📝 Overview
Code review (CR) refers to the process of having other developers on the team check the code written by a particular developer. It aims to improve the code quality and find code defects and plays an important role in software quality maintenance. Some research had proposed some CR benchmarks and automatic CR approaches. However, existing CR benchmarks and approaches, lack of comprehensiveness, which is not close to the real scenario. The rapid growth of Large Language Model (LLM) capabilities has made comprehensive CR a possibility. To evaluate the LLMs' performance in comprehensive CR, we construct a comprehensiveness-aware CR dataset in Python, namely SWE-CARE. The dataset is categorized into nine types and each instance's information covers the full process of code review. In addition, the repository-level feature is also included in each instance. Based on the dataset, we design a framework to evaluate LLM’s performance on CR.
🛠️ Set Up
Follow these steps to set up the project locally.
1. Clone the repository:
git clone https://github.com/your-username/SWE-CARE.git cd SWE-CARE
2. Install dependencies: This project uses uv for package management. Make sure you have Python 3.10 or higher.
pip install uv uv sync
Alternatively, you can use pip:
pip install -e .
3. Set up pre-commit hooks (for development): This project uses ruff for linting and formatting. The pre-commit hooks will run these checks automatically before each commit.
pre-commit install
🚀 Quick Start: Evaluation Pipeline
For a streamlined evaluation workflow, use the bootstrap script in scripts/run_eval_pipeline.py:
# Set up environment variables export OPENAI_API_KEY="your-openai-api-key" export LLM_EVALUATOR_OPENAI_API_KEY="your-evaluation-api-key" # Run the complete pipeline (uses default Hugging Face dataset) python scripts/run_eval_pipeline.py \ --output-dir results/pipeline_output \ --model gpt-4o \ --model-provider openai \ --file-source oracle # Run with local dataset file python scripts/run_eval_pipeline.py \ --dataset-name-or-path results/dataset/code_review_task_instances.jsonl \ --output-dir results/pipeline_output \ --model gpt-4o \ --model-provider openai \ --file-source oracle # Use skeleton stubs for Python files (optional) python scripts/run_eval_pipeline.py \ --dataset-name-or-path results/dataset/code_review_task_instances.jsonl \ --output-dir results/pipeline_output \ --model gpt-4o \ --model-provider openai \ --file-source bm25 \ --k 10 \ --retrieval-output-dir results/retrieval_output \ --use-skeleton
This script automates the entire evaluation process: text generation → inference → evaluation. The LLM evaluator defaults to OpenAI o3 (override with --evaluator-model). See [scripts/README.md](scripts/README.md) for detailed usage.
Analysis and Reporting
After running evaluations, you can generate comprehensive analysis reports:
# Generate evaluation report from pipeline results python scripts/eval_report.py \ --dataset-name-or-path results/dataset/code_review_task_instances.jsonl \ --eval-output-dir results/pipeline_output/evaluation/o3 \ --report-output-file results/evaluation_report.json # Or use default Hugging Face dataset python scripts/eval_report.py \ --eval-output-dir results/pipeline_output/evaluation/o3 \ --report-output-file results/evaluation_report.json
This generates detailed statistics including:
- Model performance across different file source settings (none, oracle, bm25 with k)
- Performance breakdown by evaluator type (RuleBasedEvaluator, LLMEvaluator)
- Performance analysis by metadata categories (problem domain, difficulty, estimated review effort)
- Ranking of all model-setting configurations by average score
- Identification of missing instances (assigned score of 0 for fair comparison)
The output is a comprehensive JSON report that can be used for further analysis and visualization.
📊 Data Collection
The data collection process involves several steps to gather and process data from GitHub. The main scripts for this process are located in src/swe_care/collect.
Here's an example of the command-line usage for each step:
1. Get Top Repositories: Find the most starred repositories for a given language.
python -m swe_care.collect get_top_repos \ --language "Python" \ --top-n 100 \ --output-dir "results/top_repos" \ --tokens "your_github_pat"
2. Get Pull Request Data: Fetch PR data from a specific repository using the GitHub GraphQL API.
python -m swe_care.collect get_graphql_prs_data \ --repo "/" \ --output-dir "results/graphql_prs_data" \ --tokens "your_github_pat" \ --max-number 20
3. Classify PRs Data: Analyze and classify PR data by evaluating commits and labeling review comments.
Single file processing:
python -m swe_care.collect classify_prs_data \ --graphql-prs-data-file "results/graphql_prs_data/___graphql_prs_data.jsonl" \ --output-dir "./results/classify_prs_data" \ --tokens "your_github_pat"
Batch processing (multiple repositories):
python -m swe_care.collect classify_prs_data \ --graphql-prs-data-file "results/graphql_prs_data/" \ --output-dir "./results/classify_prs_data" \ --tokens "your_github_pat" \ --jobs 4
This step combines two important analyses:
- Commit Evaluation: Uses heuristic rules to score commits based on quality indicators (message clarity, size, review activity, etc.)
- Review Comment Classification: Extracts and labels review comments based on whether referenced lines were actually changed in the merged commit, or the review thread is resolved, outdated, or collapsed.
4. Build Code Review Dataset: Build the final dataset for the code review task. This step requires an LLM to classify metadata such as problem domain, difficulty, and review effort for each task instance.
Single file processing:
# Example with OpenAI GPT-4o export OPENAI_API_KEY= python -m swe_care.collect build_code_review_dataset \ --graphql-prs-data-file…
Excerpt shown — open the source for the full document.
Notability
notability 3.0/10New repo with low stars, routine