Star 历史趋势
数据来源: GitHub API · 生成自 Stargazers.cn
README.md

Harness Primitives for Long-Running Claude Agents

Claude Code's built-in /goal command gives you a generator/evaluator loop out of the box: set a completion condition and a separate fast model checks it after every turn until it's met. This repo ships the same underlying primitives as short, readable hooks and a subagent, so you can see how each mechanism works and assemble a harness tuned to your project. The patterns come from Effective Harnesses for Long-Running Agents (Nov 2025) and Harness Design for Long-Running Application Development (Mar 2026). We recommend trying both the in-product features and a custom harness to see which fits your workflow.

In-productCustom harness (this repo)
What runs the loop/goalthe primitives below + a loop you write
Who judges "done"a separate fast model checking your conditionyour agents/evaluator.md with your prompt
Where it worksClaude Code interactive, -p, Remote ControlClaude Code, headless, or Agent SDK

Three primitives form the quality loop:

  • Default-FAIL contract. Every criterion starts false; the agent can't mark it passing without opening evidence first.
  • Fresh-context evaluator. A separate agent with no Write/Edit tools grades the work from a context window that never saw the build.
  • Agent-maintained handoff. The agent writes its own progress notes and commits to git so the next session picks up cleanly.

Two more operator-control hooks are included for when you want to watch or intervene. The same patterns translate directly to PreToolUse/Stop callbacks in the Agent SDK.

Built as the take-home for the Long-Running Agents station at Code with Claude 2026. These are example ingredients, not a turnkey harness. Event demo; not maintained and not accepting contributions.

How to use this repo

Read and cherry-pick. Each primitive is one standalone file with no dependency on the others. The quality-loop table below maps every one to its example here and its Agent SDK equivalent. Open the file, see how the mechanism works, copy what fits.

Or copy all of them as a starting point. In your project, run claude and paste:

Clone github.com/anthropics/cwc-long-running-agents into /tmp, copy its claude-code-config/.claude/ directory into this project's root, make the hook scripts executable, then walk me through what each hook and the evaluator subagent does and what I'll need to adapt for this project.

or do it yourself:

cp -r claude-code-config/.claude /path/to/your/project/ chmod +x /path/to/your/project/.claude/hooks/*.sh

Either way, this gives you all the examples wired into .claude/ at once. Before relying on them: point RESULTS_FILE at your project's actual results file, adjust the evidence-file pattern in track-read.sh, and run claude from the directory that contains .claude/ (hooks are not loaded when launching from a subdirectory).

If you're on the Agent SDK, this repo is a pattern reference. The shell hooks here translate one-to-one to PreToolUse/Stop callbacks. To scaffold an SDK agent from inside Claude Code, install the agent-sdk-dev plugin and ask Claude to build an agent that implements whichever of these primitives you want. For a hand-written starting point, see the autonomous-coding quickstart or the agent-sdk-workshop curriculum.

The quality loop

PrimitiveClaude Code exampleAgent SDK equivalentEnforcement
Default-FAIL contracthooks/track-read.sh + hooks/verify-gate.shPreToolUse callbackhook
Fresh-context evaluatoragents/evaluator.md subagent (no Write/Edit)evaluator_optimizer.ipynbyou invoke it
Agent-maintained handoffCLAUDE.md + hooks/commit-on-stop.shsystem prompt + Stop callbackconvention + hook backstop

Default-FAIL contract

Agents will mark a feature "passing" after a unit test or a curl when the UI is visibly broken. Asking nicely in the prompt doesn't reliably stop this. The harness makes "done" structural. Every feature is a row in a test-results.json file you create in your project:

{ "feature-1": { "passes": false }, "feature-2": { "passes": false } }

The only evidence that counts is a file matching the patterns in track-read.sh (screenshots, console logs, result files), and a PreToolUse hook denies any write to the results file unless the agent has first opened one with the Read tool. The agent can't claim success it hasn't observed. (The shipped hook is intentionally simple; see the comments in verify-gate.sh for the gaps a production version would close.)

Fresh-context evaluator

The builder shouldn't grade its own work. After each feature, you (or your wrapper script) invoke a separate subagent (agents/evaluator.md) with no Write/Edit tools that reviews the diff and the screenshots from a context window that never saw the build, then returns PASS or NEEDS_WORK with specific findings. On NEEDS_WORK the findings become the next builder session's starting prompt, closing the build/evaluate/rebuild loop. Invoke it from your wrapper with claude --agent evaluator -p "<review prompt>"; what your loop does with the verdict is up to you.

Agent-maintained handoff

A fresh session has no memory of what the previous one did, and when a long session fills its context window Claude Code summarizes the history, which loses detail. So the agent maintains the handoff itself: it scopes each session to one feature, writes to a structured PROGRESS.md as it works and re-reads it first thing on every restart, and git adds and commits at meaningful checkpoints so git log is a second record. commit-on-stop.sh is the backstop that catches whatever's still uncommitted at session end. This is the layer most sensitive to model capability; newer models drift less and self-scope better, so re-evaluate how much of CLAUDE.md you still need after each model release (see Re-simplify on model upgrades).

A fourth core piece, a rubric for subjective work, isn't shipped here because it's project-specific; see Going further for how to add one to the evaluator.

Running the loop

Two ways to keep the build → evaluate → rebuild cycle going. /goal is built into Claude Code and works with or without this repo's primitives; the second path wires the contract file and your own evaluator.md directly into the loop.

/goal: built-in completion checker

/goal every feature in PROGRESS.md is implemented, committed, and its tests pass

After every turn a separate fast model checks the condition and keeps the session going until it's met. One line, no contract file or hooks. Works the same in interactive Claude Code, claude -p, and Remote Control. See the docs for writing an effective condition and how /goal compares to /loop and Stop hooks.

Your evaluator.md as the gate

When you want the default-FAIL contract (test-results.json + verify-gate) enforcing evidence and your own evaluator prompt deciding PASS/NEEDS_WORK, the loop lives in your code. How depends on where you're running:

SurfaceHow
Claude Codecustom Stop hook runs the evaluator as a fresh claude process after each turn and blocks on anything but PASS
Headless (claude -p)wrapper script calls claude --agent evaluator -p between builds (example below)
Agent SDKseparate generator and evaluator query() calls; see evaluator_optimizer
while grep -q '"passes": false' test-results.json; do claude -p "Read PROGRESS.md and build the next unfinished feature per CLAUDE.md." VERDICT=$(claude --agent evaluator -p "Review the most recent commit against its spec.") [ "$(echo "$VERDICT" | head -1)" = "PASS" ] || echo "$VERDICT" > NEXT_FINDINGS.md done

Each pass is a fresh context. Exit when the contract file has nothing left failing, a cycle makes no changes, or a budget is hit; touch AGENT_STOP to stop early. The builder writes test-results.json (verify-gate enforces evidence-read first) and the evaluator returns a separate verdict; have your wrapper write the contract on PASS instead if you want "all true" to mean "independently confirmed."

Operator controls

Two more hooks for when you want to watch the loop run or intervene mid-stream:

HookWhat it does
hooks/kill-switch.shHalts every tool call while an AGENT_STOP file exists at the project root.
hooks/steer.shSurfaces the contents of STEER.md (project root) to the agent once and clears it, so you can redirect mid-run without restarting.

Watching it work

Everything the agent does lands on disk, so you can observe a long run from a terminal without any dashboard code:

watch -n 2 'tail -20 PROGRESS.md' # its own notes watch -n 5 'git log --oneline -8' # work saved watch -n 5 'find screenshots -name "*.png" | tail -5' # what it sees watch -n 2 'wc -l < .claude/.evidence-reads 2>/dev/null' # evidence reads (resets to 0 after each gated write)

Run them in a tmux grid and you have a live view of every primitive.

Going further

The next layer of patterns, each covered in depth in Harness Design for Long-Running Application Development:

PatternWhat it addsReference
Unattended loopCap session length and have an outer script start the next one (pick next feature, build, evaluate, reset)Why naive implementations fall short; ralph-loop plugin
Planner agentA first session that expands a one-line ask into a BUILD_PLAN.md the loop then runs againstThe architecture
Sprint contractsBuilder and evaluator agree per-feature on what "done" means and write it to a file the hook enforcesThe architecture and Removing the sprint construct
Grading rubricsGive the evaluator scoring principles (functionality, design, craft, originality) with few-shot examples instead of binary pass/failFrontend design: making subjective quality gradable; frontend-design plugin
Browser-verified evaluatorLet the evaluator open the running app itself instead of trusting the builder's screenshotsFrontend design (Playwright MCP usage); add @playwright/mcp or Claude in Chrome to tools: in agents/evaluator.md
Re-simplify on model upgradesAfter each model release, comment out harness pieces one at a time and see what's still load-bearingWhat comes next; Harnessing Claude's Intelligence
Hosted runtimeAnthropic hosts the loop, sandbox, and scheduling so you don't run any of this yourselfClaude Managed Agents

关于 About

No description, website, or topics provided.

语言 Languages

Shell100.0%

提交活跃度 Commit Activity

代码提交热力图
过去 52 周的开发活跃度
3
Total Commits
峰值: 2次/周
Less
More

核心贡献者 Contributors