RepoMeituan (LongCat)Meituan (LongCat)published Sep 3, 2025seen 5d

meituan-longcat/SOP-Maze

Python

Open original ↗

Captured source

source ↗
published Sep 3, 2025seen 5dcaptured 10hhttp 200method plain

meituan-longcat/SOP-Maze

Description: SOP(Standard Operating Procedure)-Maze benchmarking LLM's performance on EXTREMELY complicated business tasks.

Language: Python

Stars: 7

Forks: 2

Open issues: 1

Created: 2025-09-03T09:05:53Z

Pushed: 2025-10-07T05:25:48Z

Default branch: main

Fork: no

Archived: no

README:

SOP-Maze

SOP-Maze is a benchmark designed to evaluate the comprehensive capabilities of large language models (LLMs) in executing tasks that follow Standard Operating Procedures (SOPs).

🧩 Overview

SOP-Maze presents complex, structured tasks that mimic real-world procedural workflows. It tests an LLM's ability to:

  • Understand and follow SOPs.
  • Reason through multi-step operations.
  • Produce accurate, context-aware outputs.

📁 Directory Structure

.
├── raw_data/ # Original data samples (JSON)
├── data_with_model_response/ # Populated with model-augmented samples
├── quick_start.py # Script to run evaluation

🛠️ Setup Instructions

1. Prepare the Data

Before evaluation, enrich each JSON file in raw_data/ by adding a new key:

"model_response": ""
  • Copy the updated files into the data_with_model_response/ directory.
  • Important: Make sure to clear the data_with_model_response/ directory before copying in new files.

You can refer to the examples already in data_with_model_response/ for formatting guidance.

2. Run Evaluation

To begin evaluation, run:

sh quick_start.py

This will execute the evaluation pipeline on the updated dataset.

---

Notability

notability 0.0/10

Low stars, trivial new repo