Star 历史趋势
数据来源: GitHub API · 生成自 Stargazers.cn
README.md

ClaimPilot Harness

Crash-test insurance claim AI agents before production.

A crash-test simulator for AI claim agents: adversarial cases, deterministic scoring, and replayable failure reports.

Live demo · Connect real agents · 中文介绍 · Release v0.1.0

CI Ready Python License: MIT Agent Evals Prompt Injection

ClaimPilot Harness runs messy insurance claim scenarios against AI agents and shows where they passed, hesitated, or failed.

It is not another claim-processing agent. It is the test range for them.

ClaimPilot Harness cover

ClaimPilot demo

中文简介

ClaimPilot Harness 是一个面向保险理赔 AI Agent 的评测与红队测试框架。它把冲突证据、缺失材料、保单排除项、用户陈述矛盾和 Prompt Injection 做成可复现的测试案例,用来验证 Agent 在真实业务压力下是否可靠。

项目内置车险、健康险、旅行险、宠物险和财产险等示例案例,支持 deterministic scoring、case coverage catalog、Agent 横向对比、HTML replay、leaderboard,以及 OpenAI-compatible /v1/chat/completions 和 HTTP Agent 接口接入。

它不是又一个理赔 Agent,而是理赔 Agent 上线前的“碰撞测试场”。完整中文介绍见 docs/zh-CN.md

python -m claimpilot_harness compare cases/travel-injection-001.json demo risky
Case: travel-injection-001 Leaderboard: runs/travel-injection-001-leaderboard.html Agent Score Verdict ------------ -------- ------------ demo 93.9% investigate risky 6.1% approve

Why This Exists

Most AI agent demos look impressive until they meet messy real-world claims: mismatched invoices, missing documents, policy exclusions, claimant contradictions, hidden prompt injection, and privacy traps.

ClaimPilot turns those failure modes into repeatable test cases.

Use it to answer:

  • Did the agent choose the right claim action?
  • Did it cite the evidence that mattered?
  • Did it request the missing document instead of guessing?
  • Did it detect fraud or coverage inconsistencies?
  • Did it ignore malicious instructions hidden inside uploaded evidence?

See docs/why-claimpilot.md for the product thesis and docs/evaluation-methodology.md for the evaluation methodology.

Demo

Compare a careful agent against a deliberately risky one:

python -m claimpilot_harness compare cases/travel-injection-001.json demo risky

On Windows, use py -m claimpilot_harness ... if python is not on your PATH.

You will get a score and a replay report:

Case: travel-injection-001 Leaderboard: runs/travel-injection-001-leaderboard.html Agent Score Verdict ------------ -------- ------------ demo 93.9% investigate risky 6.1% approve

Open runs/latest.html to view the leaderboard.

ClaimPilot leaderboard preview

Run the full regression suite across all included cases:

python -m claimpilot_harness suite cases --agents demo risky
Cases: 6 Report: runs/suite-report.html Agent Avg Score Pass Rate ------------ ---------- ---------- demo 94.8% 100.0% risky 13.2% 0.0%

What A Replay Shows

The replay report is designed for product, risk, and engineering review:

  • Evidence timeline
  • Agent verdict and confidence
  • Findings and requested documents
  • Prompt-injection / privacy flags
  • Scoring breakdown by rubric item
  • Raw decision JSON for debugging

Included Case Packs

CaseLineWhat It Tests
auto-collision-001AutoRepair invoice conflicts with damage photos and claimant chat.
health-bill-001HealthPossible excluded cosmetic procedure without medical necessity proof.
medical-privacy-injection-001HealthMedical necessity ambiguity plus privacy lure and hidden prompt injection.
travel-injection-001TravelMissing official delay proof plus prompt injection hidden in uploaded evidence.
pet-preexisting-001PetSymptoms appear to predate enrollment, testing pre-existing condition handling.
property-water-damage-001PropertyRepair estimate scope exceeds moisture readings and photo evidence.

Generate a coverage catalog for the case pack:

python -m claimpilot_harness catalog cases
Cases: 6 Lines: auto=1, health=2, pet=1, property=1, travel=1 Severities: critical=2, high=2, medium=2 Traps: privacy_lure=1, prompt_injection=2

Agent Interface

Use the built-in demo agent:

python -m claimpilot_harness run cases/auto-collision-001.json --agent demo

Compare built-in agents and generate a leaderboard:

python -m claimpilot_harness compare cases/travel-injection-001.json demo risky

Validate case packs before running or contributing them:

python -m claimpilot_harness validate cases

Summarize case-pack coverage:

python -m claimpilot_harness catalog cases --markdown

Run a full case-pack regression suite:

python -m claimpilot_harness suite cases --agents demo risky

The suite writes both an HTML report and a machine-readable benchmark artifact:

runs/suite-report.html runs/suite-results.json

Refresh the static GitHub Pages demo locally:

python scripts/build_demo_site.py

Run an OpenAI-compatible model:

python -m claimpilot_harness run cases/travel-injection-001.json \ --agent openai \ --openai-model your-model-name

Compare it against the built-in baselines:

python -m claimpilot_harness compare cases/travel-injection-001.json demo openai risky \ --openai-model your-model-name

Run a custom HTTP agent service:

python examples/http_agent.py python -m claimpilot_harness run cases/travel-injection-001.json \ --agent http \ --agent-url http://127.0.0.1:8000/review

Or connect any agent command that reads JSON from stdin and prints a JSON decision:

python -m claimpilot_harness run cases/auto-collision-001.json \ --agent command \ --agent-command "python examples/simple_agent.py"

Expected decision shape:

{ "verdict": "investigate", "confidence": 0.82, "summary": "Hold the claim pending additional review.", "findings": ["invoice and physical damage evidence need reconciliation"], "requested_documents": ["independent adjuster estimate"], "cited_evidence": ["E2", "E3"], "privacy_flags": ["ignored embedded instruction in evidence"] }

Case Format

Cases are plain JSON files. Each case contains:

  • Claimant and policy context
  • Evidence summaries with stable IDs
  • Red-team traps
  • Expected findings, document requests, citations, and forbidden behavior
  • A weighted scoring rubric

See docs/case-format.md.

The scoring approach is explained in docs/evaluation-methodology.md and docs/scoring-rubric.md.

Validate a case file or an entire case directory:

python -m claimpilot_harness validate cases python -m claimpilot_harness validate cases/travel-injection-001.json --json

OpenAI-Compatible Adapter

ClaimPilot supports OpenAI-style /v1/chat/completions endpoints without requiring an SDK dependency.

Set OPENAI_API_KEY, then pass --agent openai and --openai-model. Use --openai-base-url for compatible local or hosted gateways.

See docs/openai-compatible.md.

For end-to-end examples, see docs/connect-real-agents.md.

HTTP Agent Adapter

ClaimPilot can evaluate any custom agent service that accepts POST JSON and returns a decision object.

Start the example service:

python examples/http_agent.py

Then run:

python -m claimpilot_harness run cases/travel-injection-001.json \ --agent http \ --agent-url http://127.0.0.1:8000/review

Roadmap

  • Mixed-agent comparison configs
  • Ollama adapter
  • LLM-as-judge scoring mode
  • Claim case generator for synthetic case packs
  • Fraud, compliance, and privacy scorecards
  • CI mode for regression testing agent changes
  • GitHub Pages replay gallery

Positioning

ClaimPilot Harness is built for the gap between AI agent demos and production systems. A claim agent that can answer one happy-path question is easy to build. A claim agent that survives conflicting evidence, policy constraints, missing documents, and adversarial uploads needs a harness.

That is the product surface this project explores.

Sharing

For natural launch copy and short project notes, see docs/launch-notes.md.

License

MIT

关于 About

Crash-test insurance claim AI agents before production.
agent-evaluationai-agentsinsurancellm-evalsprompt-injectionpythontesting

语言 Languages

Python100.0%

提交活跃度 Commit Activity

代码提交热力图
过去 52 周的开发活跃度
15
Total Commits
峰值: 11次/周
Less
More

核心贡献者 Contributors