# Contributing to ClawBench **First-time open-source contributor?** We specifically want to hear from you. Most of ClawBench is plain JSON + a short schema, and the thing that makes the benchmark valuable is breadth of everyday tasks — which means the best people to contribute are the people doing everyday things on the web. If you've booked a doctor's appointment, ordered groceries, filed a government form, joined a book club, or returned an Amazon package, you already know how to write a test case. ## What we're looking for | Contribution | Effort | Impact | |---|---|---| | Add one new test case | ~30 min | Every additional site expands coverage | | Add a batch of test cases in one new category (e.g., healthcare, K-12, real estate) | ~2-3 hours | Unlocks a new evaluation axis | | Add a new model config (in `models/models.yaml`) | ~1 hour | Get your model on the leaderboard | | Fix a flaky task (find a broken task, propose a fix) | ~20 min | Keeps the leaderboard fair | | Translate docs into a new language | ~1 hour | Chinese / Japanese / Korean / Spanish welcomed | | Report a bug via [issue template](https://github.com/reacher-z/ClawBench/issues/new/choose) | ~5 min | Helps us prioritize | ## Recognition for contributors - **Every merged PR** — your GitHub avatar appears on the repo contributor wall automatically. - **3+ merged test cases** — you're listed in the Community Contributors section of the README and thanked in the next arXiv paper revision's acknowledgments (with your consent). - **A whole new category (10+ tasks)** — proposed as a co-author on the relevant follow-up paper / workshop submission. - **Model runner** — your model ships on the live leaderboard at [claw-bench.com](https://claw-bench.com) with a direct link to your org. ## Good first issues The issue tracker has a [`good first issue`](https://github.com/reacher-z/ClawBench/labels/good%20first%20issue) label for contributions sized at "30 minutes, no container experience required." Typical entries: - Add a new test case for a site we don't yet cover (list in the issue description) - Verify a flagged-flaky task still works - Add an `extra_info/` file to enrich an existing task If no `good first issue` is currently open, just open one with your idea and we'll label it and help you land the PR. ## Adding a new test case ClawBench currently has two full task corpora: V1 lives in `test-cases/v1/` with 153 tasks, and V2 lives in `test-cases/v2/` with 130 tasks. Both use `test-cases/task.schema.json`. Unless a maintainer asks for a V2-only contribution, add new tasks to V1. 1. **Pick a task ID** — find the next available number by checking existing directories in the target corpus (`test-cases/v1/` for V1, or `test-cases/v2/` for V2). 2. **Create the directory** following the naming convention: ``` test-cases/v1/---/ ``` Example: `test-cases/v1/887-daily-life-food-grubhub/` V2 directories use the `v2-` prefix: ``` test-cases/v2/v2----/ ``` 3. **Create `task.json`** in the directory. It must conform to `test-cases/task.schema.json`: ```json { "$schema": "../../task.schema.json", "metadata": { "task_id": 887, "metaclass": "daily-life", "class": "food", "description": "Order a burger on Grubhub", "sites_involved": ["grubhub.com"], "platform": "grubhub", "common_info": { "email_credentials": "credentials to use the assigned disposable email account", "user_info": "alex_green_personal_info.json; the dummy user's personal information", "user_resume": "PDF resume with disposable email account injected" } }, "instruction": "On Grubhub, order delivery: one cheeseburger to home address", "eval_schema": { "url_pattern": "__PLACEHOLDER_WILL_NOT_MATCH__", "method": "POST" }, "time_limit": 30 } ``` For V2 tasks, use the same shared schema reference: `"$schema": "../../task.schema.json"`. 4. **Set the eval_schema** — this tells the interceptor which HTTP request to block: - If the task has an irreversible action (form submit, email send, application submit): set `url_pattern` to a regex matching the submission endpoint, and `method` to the HTTP method. - If the task is behind a payment wall (agent has no valid credit card): use `"url_pattern": "__PLACEHOLDER_WILL_NOT_MATCH__"` — the interceptor will never fire and the session runs until timeout. - See [`test-cases/task.schema.json`](test-cases/task.schema.json) for the supported `body` and `params` filters. 5. **Test with human mode** to verify the task is completable: ```bash uv run clawbench-run test-cases/v1/887-daily-life-food-grubhub --human ``` For V2: ```bash uv run clawbench-run test-cases/v2/v2-887-daily-life-food-grubhub --human ``` 6. **Submit a PR** with your new test case directory. Include in the PR description: - Which site and category - The URL where the "Submit" action fires (so reviewers can confirm the interceptor config) - A one-line note on whether the task is solvable by a human in under a minute ## Extra info files If the task requires additional context (e.g., a pre-filled profile, a specific document), add files under `extra_info/` in the test case directory and reference them in `task.json`: ```json "extra_info": [ { "path": "extra_info/cover_letter.txt", "description": "Cover letter to attach with the application" } ] ``` ## Code changes For changes to the framework itself (test driver, runtime server, Chrome extension, container): 1. Read the relevant sub-README for component-specific documentation: - [README.md#-cli](README.md#-cli) - [src/clawbench/runtime/runtime-server/README.md](src/clawbench/runtime/runtime-server/README.md) - [src/clawbench/runtime/chrome-extension/README.md](src/clawbench/runtime/chrome-extension/README.md) 2. Open an issue first for anything beyond a small bug fix so we can align on approach before you spend time. 3. Open a PR with a clear description of the change and how you tested it. ## Reporting issues Please use the [issue templates](https://github.com/reacher-z/ClawBench/issues/new/choose) to report bugs or propose new test cases. ## Community Questions, task ideas, model submissions, or just want to chat about browser agents? - **Discord:** [discord.gg/clawbench](https://discord.gg/clawbench) — English, agent-builder-friendly, `#contributors` channel - **微信群:** use the [QR code](./assets/community/wechat_grp_422.jpg) - **GitHub Discussions:** [github.com/reacher-z/ClawBench/discussions](https://github.com/reacher-z/ClawBench/discussions) — asynchronous, searchable - **Email:** open an issue; we respond there first ## Code of conduct Be decent to each other. No harassment, no personal attacks, no spam. We follow the spirit of the [Contributor Covenant](https://www.contributor-covenant.org/version/2/1/code_of_conduct/). Maintainers reserve the right to close PRs / remove comments / ban accounts that harm the community.