🌩️ T3MP3ST

 ▄▄▄█████▓▓█████  ███▄ ▄███▓ ██▓███  ▓█████   ██████ ▄▄▄█████▓
 ▓  ██▒ ▓▒▓█   ▀ ▓██▒▀█▀ ██▒▓██░  ██▒▓█   ▀ ▒██    ▒ ▓  ██▒ ▓▒
 ▒ ▓██░ ▒░▒███   ▓██    ▓██░▓██░ ██▓▒▒███   ░ ▓██▄   ▒ ▓██░ ▒░
 ░ ▓██▓ ░ ▒▓█  ▄ ▒██    ▒██ ▒██▄█▓▒ ▒▒▓█  ▄   ▒   ██▒░ ▓██▓ ░
   ▒██▒ ░ ░▒████▒▒██▒   ░██▒▒██▒ ░  ░░▒████▒▒██████▒▒  ▒██▒ ░
   ▒ ░░   ░░ ▒░ ░░ ▒░   ░  ░▒▓▒░ ░  ░░░ ▒░ ░▒ ▒▓▒ ▒ ░  ▒ ░░
     ░     ░ ░  ░░  ░      ░░▒ ░      ░ ░  ░░ ░▒  ░ ░    ░
   ░         ░   ░      ░   ░░          ░   ░  ░  ░    ░
             ░  ░       ░               ░  ░      ░

A multi-agent offensive-security framework, built to turn the AI coding agent you already run into a zero-day hunter.

Point T3MP3ST at an authorized target and the kill chain is yours — recon → exploit → report, from a web War Room or the CLI, driven by the AI coding agent you're already signed into (Claude Code, Codex, Hermes). No new API keys, no cloud, no second bill. Your agent is the brain; T3MP3ST is the war machine you bolt around it. Self-hosted storm, keyless warfare. ⚡

The recon engine is live and tool-backed, and the exploit loop is benchmark-proven: 90.1% pass@1 on XBEN — XBOW's own 104-challenge suite — every solve graded against a committed flag oracle that verify-claims recomputes on demand (reproducible below). Then it went hunting COLD on a held-out set of 10 real CVEs disclosed in 2026, across 7 languages — post-cutoff bugs the hardened prompts were never tuned on. A single agent pinned 8/10 to the exact file, line, and CWE (verified all-exact, stable under re-scoring), and the full pack surfaced all 10 (verify-claims recomputes it from the raw findings; small n, reported honestly as directional). Memorization and overfitting, both off the table. The full 8-operator swarm is the architecture it grows into; the status table is exact about what's live, what's scaffolding, and what's still roadmap. Loud about the mission, honest about where the build is.

Three things set it apart:

Reproducible. Every number in this README recomputes from committed data — npm run verify-claims re-derives all of them, 24/24 green. A claim that can't be reproduced doesn't ship. No trust-me numbers, ever.
Keyless. The AI coding agent already on your machine is the backbone. No API keys, no second bill, no gatekeeper.
Honest about scope. The status table marks exactly what's stable, experimental, or roadmap — because red-teaming shouldn't be a priesthood, and it damn sure shouldn't run on vibes.

⚠️ Authorized use only

T3MP3ST is an offensive security tool, built for authorized testing, research, and education. Point it only at systems you own or have explicit, written permission to test. Unauthorized access to computers, networks, or data is illegal in most jurisdictions — you alone are responsible for how you use this software and for staying inside the law and your rules of engagement. Bring the storm to your targets, not someone else's.

T3MP3ST is provided as-is under the AGPL-3.0 license, with no warranty and no liability for any damage, loss, or misuse. The authors do not endorse, support, or condone unauthorized activity. Get permission. Stay in scope. Don't be a menace. 🫡

Why it exists

Offensive security sits behind years of practice and expensive tooling. The bet behind T3MP3ST is that a coordinated agent swarm puts real bug-hunting in reach of people who never got the invite, across web apps, CTFs, smart contracts, source code, and embedded/robotics OSS. That is an ambitious bet, and the sections below are careful to separate what already works from what is still a bet.

What it hunts

Domain	What it does	Status
🕸️ Web apps	Black-box, external-attacker recon → exploit (XBEN suite)	✅ Stable
🚩 CTF	Hint-free, sandbox-jailed solves (Cybench)	✅ Stable
🤖 Robotics / OT / embedded	Coordinated-disclosure pipeline for OSS vuln hunting (OSV + live-PoC + refuter)	✅ Pipeline stable
📂 Source code	White-box repo analysis with blind master-builder decomposition	⚠️ Python-only ingest
💰 Smart contracts	Damn Vulnerable DeFi	⚠️ reproduction, not novel discovery

Quick start

Fastest path to a running War Room (keyless, ~2 min to set up; mission time depends on the target):

npm install
npm run server        # War Room → http://127.0.0.1:3333/ui/

In the War Room, open Settings and connect a local agent (Claude Code / Codex / Hermes). Then describe a target to Op Admiral in plain English and launch. The agent you connected is the brain. No key required.

Prefer to bring a key? Set one and skip the connect step:

export OPENROUTER_API_KEY=...     # or VENICE_API_KEY / ANTHROPIC_API_KEY

Check the numbers for yourself:

npm run verify-claims             # re-derives every headline from committed JSON in bench/

Library/SDK usage, the full HTTP API, and MCP setup live in docs/.

What ships today

The framework is an 8-operator kill chain, and this table won't blow smoke about it. Recon is a live, tool-backed engine — and the teeth are already real: 90.1% pass@1 on XBEN, 8/10 held-out post-cutoff CVEs pinned to exact file/line/CWE, and a coordinated-disclosure pipeline that's live enough to have drafts held for vendor coordination right now. What's not proven is the swarm. Each downstream operator — Exploiter, Infiltrator, Exfiltrator, Ghost — runs the same real, tool-backed ReAct loop as recon (real exploit tools, not stubs), but the headline numbers came from a single agent, not the coordinated 8-operator cell, and end-to-end swarm exploitation is unbenchmarked and still unreliable. The engine is real; the swarm is the part still earning its stripes. Loud where we've earned it, blunt about the rest.

Component	Status	Notes
Re-derivable measurement (`verify-claims`)	✅ Stable	every headline recomputes from committed artifacts
Recon engine	✅ Stable	drives nmap / DNS / HTTP / fingerprinting; every finding traces to real tool output
Mission engine + War Room + Op Admiral	✅ Stable	keyless through a connected local agent
Arsenal, MCP server, HTTP API	✅ Stable	35 built-in tools by default; 83 with the opt-in `T3MP3ST_FULL_ARSENAL` (+48 adapters, with the dangerous post-ex drivers — metasploit, hydra — behind a human-approval gate) — both counts re-derive via `verify-claims`. `security_recon` over MCP
Egress-scope containment	✅ Stable (on by default)	once a mission target is set, built-in networked tools refuse off-scope public hosts — not the target/subdomains, not loopback/private (`SCOPE DENIED`) — a tightened default, not a bare tool runner
Coordinated-disclosure pipeline	✅ Stable	OSV novelty + live PoC + refuter panel + CVSS; drafts only, a human sends
White-box source analysis	⚠️ Experimental	Python-only regex ingest; multi-model decomposition costs more tokens, not fewer
DeFi (Damn Vulnerable DeFi)	⚠️ Experimental	reproduces known exploit classes; not novel discovery
Exploiter / Infiltrator / Exfiltrator / Ghost	⚠️ Experimental	run the real tool-backed ReAct loop (same engine as recon); unproven as a coordinated swarm — single-agent is the benchmarked path, live swarm exploitation still unreliable
Advanced modules (cloud, persistence, swarm, cognition)	🚧 Planned	interface-only in `src/stubs/`
Self-improvement loop	🧪 Research	records lessons + proposals today; feeding them back into planning is roadmap

Full feature-by-feature breakdown: FEATURES.md.

Coverage by domain

Where the storm reaches today — and where it's headed. Same discipline as everything else: a domain is ✅ only when there's a receipt behind it.

Domain	What it covers	Status
🕸️ Web	apps, APIs, auth flows, OWASP Top 10	✅ Core — XBEN 90.1% pass@1
📂 Code	white-box source audits, SAST-style vuln hunting	✅ Proven (hunt result) — held-out CVE-Zero: single-agent 8/10 exact file/line/CWE, 10/10 found (7 languages); the repo-ingest engine itself is still ⚠️ experimental
🚩 CTF	wargames, practice ranges, challenges	✅ Proven — Cybench 23/40 hint-free
🔌 Network / Infra	recon, service/stack fingerprinting; lateral + privesc	✅ recon (live nmap/DNS/HTTP engine) · ⚠️ lateral/privesc experimental
🤖 Embedded / IoT / OT	firmware, robotics, ICS/SCADA OSS	✅ CVE pipeline live — coordinated-disclosure drafts held for vendors
📦 Supply chain	dependency audits, install-without-confirmation	⚠️ Real — dedicated class; hit a CWE-829 on the held-out set
💰 Blockchain	smart contracts, DeFi, Solidity	⚠️ Reproduction only — Damn Vulnerable DeFi, not novel discovery
☁️ Cloud	AWS/GCP/Azure misconfig, IAM, serverless	🚧 In development
📱 Mobile	Android/iOS app security	🚧 In development
🏢 Identity / AD	Kerberos, pass-the-hash, AD attacks	🚧 In development
🔐 Binary / RE	overflows, ROP, exploit dev	🚧 In development — needs specialized tooling

The class/squad architecture means new domains compose rather than fork — each is a loadout (specialist classes + arsenal + target adapter + a benchmark). 🚧 domains ship dark until they have a number.

Benchmarks

Headline results. Each recomputes from the committed JSON with npm run verify-claims; full methodology and caveats are in the linked docs.

Suite	Result	Context
XBEN — XBOW's 104-challenge suite, black-box	pass@1 mean 90.1% (Wilson-95 86.2–92.9), floor 91/104 · gpt-5.5	XBOW self-reports 85% on the same suite; ours re-derives the graded verdict from committed artifacts (raw transcripts stripped for privacy)
XBEN — white-box (reported separately)	pass@1 98.7%, best-ball 104/104 · gpt-5.5	never blended with the black-box number
Cybench — 40-task academic bench, Opus 4.8, no hints	23/40 (58%) hint-free, single-run pass@1 (`verify-claims`-enforced)	not the raw-score record (Anthropic: 76.5% pass@10); every flag graded against the committed oracle
CVE-Zero — 10 real post-cutoff (2026) CVEs, held-out, 7 languages	single-agent 8/10 exact file/line/CWE (verified all-exact, stable) · 10/10 found (full pack)	memorization- & fitting-proof: post-cutoff, and the hardened prompts were never tuned on these; `verify-claims` recomputes it. n=10, directional; the swarm's edge here is recall, not a coordination-beats-solo proof

How to read these:

Every solved flag is graded against a committed ground-truth oracle — not a self-report — and verify-claims recomputes the pass/fail. Raw per-step transcripts are stripped for operator privacy, so you re-check the graded verdict, not the raw tool output. Zero fabricated, enforced by an anti-fitting guard that runs on every push.
Black-box (source withheld) and white-box (source staged) are reported separately and never blended.
These ran a single-agent ReAct loop, not the 8-operator swarm. The swarm is framework architecture; it is not what scored these numbers.
Results are system-vs-system: this harness driving a strong current model, not an isolated-harness claim.

XBOW self-reports 85% on its own suite; T3MP3ST scores 90.1% on that same suite — mean, with a Wilson-95 floor of 86.2%. But the number isn't the flex — the receipt is. Every point recomputes from committed artifacts with one command: npm run verify-claims, each solve graded against a committed flag oracle (raw transcripts are stripped for operator privacy, so you re-check the verdict, not the tool output). A keyless, open-source harness that hands you the re-run instead of asking you to trust it. Clone it, run verify-claims, and re-check every verdict against its committed oracle yourself (verdicts recompute from committed data; raw transcripts are stripped for privacy).

Deeper reading: WALL_FORENSICS (per-challenge misses), CYBENCH, INTEGRITY_LEDGER (contamination audit and every retraction), OBSIDIVM (our own live web range).

Documentation

Doc	Contents
FEATURES.md	feature-by-feature status (`[x]` shipped / `[~]` partial / `[ ]` planned)
SCOPE_AND_AUTHORIZATION	authority model, scope receipts, evidence and retest rules
TEAM_PREVIEW	first-run path and review script
INSTALL_MATRIX	macOS / Linux readiness table
ARSENAL_ACTIVATION_PLAN	optional external-tool setup
CYBENCH · WALL_FORENSICS · INTEGRITY_LEDGER · COGNITIVE_ARCHITECTURE	benchmark methodology
RELEASE_CHECKLIST	the gates a release must pass

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                        T3MP3ST COMMAND                          │
├─────────────────────────────────────────────────────────────────┤
│   MISSION CONTROL  ◄──  TARGET MODEL  ──►  ARSENAL (TOOLS)       │
│                          ▲                                       │
│   AGENT CELL:  RECON · SCANNER · EXPLOITER · INFILTRATOR ·       │
│                EXFILTRATOR · GHOST · COORDINATOR · ANALYST       │
│                          ▲                                       │
│   EVIDENCE VAULT  ·  CREDENTIAL STORE  ·  FINDINGS LEDGER        │
│                          ▲                                       │
│   OPSEC LAYER  ·  COMMS CHANNEL  ·  LLM BACKBONE                 │
└─────────────────────────────────────────────────────────────────┘

Operators map to MITRE ATT&CK and Cyber Kill Chain phases (recon is live; later phases are scaffolded):

Operator	Phase	MITRE	Function
Recon	Reconnaissance	TA0043	OSINT, network discovery, asset enumeration
Scanner	Discovery	TA0007	vulnerability scanning, service fingerprinting
Exploiter	Initial Access	TA0001	exploitation, payload delivery
Infiltrator	Lateral Movement	TA0008	post-exploitation, privilege escalation
Exfiltrator	Collection / Exfil	TA0009/10	data extraction, credential harvesting
Ghost	Persistence	TA0003	persistence, stealth, cleanup
Coordinator	Command & Control	TA0011	mission control, orchestration
Analyst	Analysis	—	pattern analysis, reporting

Providers: OpenRouter, Venice, Anthropic, OpenAI, or a keyless local agent (Claude Code / Codex / Hermes). Set OPENROUTER_API_KEY / VENICE_API_KEY / ANTHROPIC_API_KEY, or connect an agent in Settings.

Integrations: node dist/mcp-server.js exposes security_recon to MCP-aware agents. npm run server starts the HTTP API (POST /api/mission/start, GET /api/mission/status, and more). Full reference in docs/.

Contributing — join the swarm

Red-teaming shouldn't be a priesthood. Bring an adapter, a prompt pack, a runbook, a new arsenal tool, or a bug report.

One rule, non-negotiable: everything here is for authorized testing only. Owned, scoped, or consenting targets. Build for defenders, or don't build it here.

Fork it, branch it.
Open a PR with tests. If you touch a headline number, npm run verify-claims has to stay green.

Release process and gates: RELEASE_CHECKLIST.

License

AGPL-3.0. See LICENSE.

Fortes fortuna iuvat — fortune favors the bold.

⊰•-•✧ LOVE PLINY ✧•-•⊱ 🌩️