RepoOpenBMB (MiniCPM)OpenBMB (MiniCPM)published Jun 7, 2026seen 3d

OpenBMB/AceBench

Python

Open original ↗

Captured source

source ↗
published Jun 7, 2026seen 3dcaptured 12hhttp 200method plain

OpenBMB/AceBench

Language: Python

License: MIT

Stars: 5

Forks: 0

Open issues: 0

Created: 2026-06-07T07:34:14Z

Pushed: 2026-06-10T13:56:38Z

Default branch: main

Fork: no

Archived: no

README:

---

AceBench is a benchmark for edge-cloud collaboration in LLM agents. Cloud models reason best but see all your data; on-device edge models keep data local but are weaker — collaboration promises the best of both, *if* you organize it well. AceBench measures exactly that, but in a setting prior edge-cloud studies skip: real agent execution, where agents work over live workspaces (files, tools, commands, APIs, app states) and every cloud call *mid-trajectory* can expose accumulated local context.

We evaluate six execution strategies — pure edge, pure cloud, and four edge-cloud collaboration patterns — across 128 executable tasks (100 with fine-grained privacy annotations) on an OpenClaw harness, scoring every run on three axes at once: task utility, resource cost, and privacy exposure. The result exposes how *when* the cloud is invoked and *what* context is sent trade capability against cost and leakage.

Design Highlights

| | What we test | Why it matters | | --- | --- | --- | | 🦞 OpenClaw-native | The real OpenClaw agent loop — bash, browser, file ops, APIs, and reusable SKILL.md skills — driving a live local workspace | Tasks need long-horizon planning, state tracking, and error recovery; cloud calls land *mid-trajectory* over accumulated workspace context, not on a static prompt | | 🔐 Privacy-aware | 100 tasks annotated with sensitivity units (PII + org secrets) | Every cloud invocation is a potential leakage channel — we audit what crosses the boundary | | ⚖️ Multi-dimensional | Utility · Cost · Privacy, reported jointly | No single number hides the trade-off; you see the whole Pareto picture | | 🔀 Strategy-centric | 6 edge / cloud / edge-cloud strategies, one task suite | Isolates *how* collaboration is organized from *which* models are used | | 📦 Reproducible | Each task runs in its own Docker container | Graders are injected only after the agent finishes — never visible during execution |

---

Tasks

128 executable tasks across 6 categories (Chinese & English); 100 carry fine-grained privacy annotations. Each is a self-contained Markdown file under [tasks/ACE_Bench/](tasks/ACE_Bench/) — a prompt, an inline grade() verifier, and a workspace path.

| Category | # | Example tasks | Core challenges | | --- | --- | --- | --- | | Office & Daily Tasks | 36 | ambiguous contact email, meeting notes, expense report, daily summary | Multi-source aggregation, clarification, structured output | | Information Search & Gathering | 34 | email search, competitive intelligence, paper affiliation lookup, CRM bug hunt | Web + local data reconciliation, source verification | | Safety & Security | 21 | leaked API-key detection, prompt injection, malicious skill, HIPAA/PHI referral | Adversarial robustness, credential awareness, refusal | | Data Analysis | 14 | order-profit analysis, month-end reconciliation, quarterly business insight | Spreadsheet reasoning, state verification | | Development & Operations | 13 | system health check, automation-failure recovery, LLM API gateway skill | Undocumented setups, debugging, skill creation | | Automation | 10 | flight booking, n8n workflow report, scheduled-briefing skill | Long-horizon orchestration, recovery |

Scoring. Every run is graded on three dimensions at once:

  • Utility — completion score + Pass³ (3-trial consistency), from each task's own verifier.
  • Cost — cloud tokens & USD, plus edge-side FLOPs.
  • Privacy — how much annotated sensitive context (PII / org secrets) reaches the cloud.

---

Leaderboard

Edge = Qwen3.5-9B / 27B, Cloud = GPT-5.4, judge = GPT-5.4-mini, averaged over 3 runs. Cloud Tok. = raw / cache / output (millions); Cost in USD; Edge FLOPs in PetaFLOPs; Utility & Privacy in %.

Edge-cloud collaboration beats both single-side extremes on the utility–privacy trade-off; Sketch-Guided keeps privacy at 100%, Task-Routing is the most balanced, and Adaptive Assistance gets the best Pass³ at //...` (scores, token/cost usage, agent trace, produced files), with a per-category and global summary generated automatically once the run finishes.

---

Acknowledgements

AceBench stands on the shoulders of a remarkable open-source agent community, and we are deeply grateful for it. The OpenClaw harness gives us a real, full-featured agent runtime — tools, skills, and a live workspace — to build on. Our tasks and evaluation design draw inspiration and adapted material from a series of outstanding agent benchmarks: Claw-Eval, WildClawBench, QwenClawBench, LiveClawBench, PinchBench, and ClawBench. Their meticulous task curation, rigorous grading, and reproducible harness design set the bar for trustworthy agent evaluation, and made the privacy-aware, edge-cloud extension in AceBench possible.

License

Released under the [MIT License](LICENSE).

Notability

notability 3.0/10

Low stars, new benchmark repo.