Star 历史趋势
数据来源: GitHub API · 生成自 Stargazers.cn
README.md

superpowers-bench

X Discord

Can your agent pick the right skill without being told?

You give your agent a set of skills. You describe a task. Does it figure out which skills to use - or does it need hand-holding?

superpowers-bench measures how well coding agents (Claude Code, Codex, OpenCode) discover and invoke skills from obra/superpowers based on context alone. It tests two conditions: no hints (just the task prompt) and with hints (task prompt + a natural-language hint that implies the right workflow without naming it).

superpowers-bench F1 scores

  • Skill selection as the unit of measure - not code quality, not pass/fail on tests. Did the agent reach for the right tools?
  • Apples-to-apples comparison - same 20 tasks, same skills, same grading. Just swap the agent.
  • Baseline vs triggered - quantifies how much a well-written hint improves skill discovery without explicit instruction.

Quick Start

$ npm install $ npm run fetch-skills # download skill definitions $ npm run bench -- run --condition claude --task explain_code # ... agent runs, result appended to results/results.jsonl $ npm run bench -- matrix --parallel 4 # run full benchmark $ npm run bench -- report # generate report

Install

From source (only method - this is a benchmarking tool, not a published package):

git clone https://github.com/kunchenguid/superpowers-bench.git cd superpowers-bench npm install npm run fetch-skills

Prerequisites: Node.js 16+, and at least one supported agent CLI on your PATH with valid API credentials: claude (Claude Code CLI), codex (OpenAI Codex CLI), or opencode (OpenCode CLI).

How It Works

┌──────────────────────────────────────────────────┐
│ CLI: run / matrix / report                       │
└──────────────────┬───────────────────────────────┘
                   ▼
┌──────────────────────────────────────────────────┐
│ Runner                                           │
│  1. Clone target repo into isolated workspace    │
│  2. Copy skill definitions into workspace        │
│  3. Build prompt (+ trigger hint if triggered)   │
│  4. Execute agent (multi-turn, auto-respond x3)  │
│  5. Detect which skills were invoked             │
│  6. Grade: precision / recall / F1 vs expected   │
│  7. Append result to results.jsonl               │
└──────────────────────────────────────────────────┘
                   ▼
┌──────────────────────────────────────────────────┐
│ Reporter                                         │
│  Load results.jsonl ► markdown tables            │
│  Summary by condition, per-task, per-skill       │
└──────────────────────────────────────────────────┘
  • Isolated workspaces - each parallel worker gets its own git copy of the target repo to prevent interference
  • Detection method - Claude: parsed from Skill tool_use events. Codex/OpenCode: fingerprint matching against distinctive phrases from skill bodies
  • Grading - set comparison between detected and expected skills. Pass = no missing, no extra

CLI Reference

CommandDescription
npm run bench runExecute a single condition x task combination
npm run bench matrixRun full benchmark matrix across all combos
npm run bench reportGenerate markdown report from results.jsonl
npm run fetch-skillsDownload skill definitions from superpowers

run and matrix start from a fresh results/ directory. report always reflects the latest benchmark batch, not an append-only history.

Flags

CommandFlagDescription
run--condition <id>Condition to run (required)
run--task <id>Task to run (required)
run--repeat <n>Number of repetitions (default: 1)
matrix--parallel <n>Max concurrent workers (default: 4)
matrix--condition <id>Filter to one condition
matrix--task <id>Filter to one task
matrix--repeat <n>Repeats per combination (default: 1)

Configuration

All config lives in config/:

  • conditions.yaml - defines agent variants (agent, model, triggered flag)
  • tasks.yaml - 20 tasks with prompts, expected skills, and trigger hints
  • fingerprints.yaml - distinctive phrases for Codex/OpenCode skill detection

Conditions

IDAgentModelTriggered
claudeClaudeclaude-opus-4-6no
claude-triggeredClaudeclaude-opus-4-6yes
claude-opus-4-7Claudeclaude-opus-4-7no
claude-opus-4-7-triggeredClaudeclaude-opus-4-7yes
codexCodexgpt-5.4no
codex-triggeredCodexgpt-5.4yes
opencode-gpt-5-4OpenCodeopenai/gpt-5.4no
opencode-gpt-5-4-triggeredOpenCodeopenai/gpt-5.4yes

Development

npm install # install dependencies npm run fetch-skills # download skill definitions npm run bench # run CLI (see subcommands above)

关于 About

Can your agent use the right skills for the right tasks?

语言 Languages

TypeScript98.8%
Shell1.2%

提交活跃度 Commit Activity

代码提交热力图
过去 52 周的开发活跃度
4
Total Commits
峰值: 3次/周
Less
More

核心贡献者 Contributors