Benchmark
pwnkit ships a built-in benchmark suite for measuring detection accuracy across vulnerability categories. Each challenge hides a FLAG{...} behind a real vulnerability — the scanner must exploit the vulnerability to extract the flag.
Running benchmarks
Section titled “Running benchmarks”# Baseline (no API key, deterministic checks only)pnpm bench
# Quick subsetpnpm bench:quick
# Full agentic pipeline with AI analysispnpm bench --agentic --runtime autoThe benchmark spins up test targets (vulnerable servers), runs pwnkit against them, and checks whether each flag was captured.
Challenge format
Section titled “Challenge format”Each benchmark challenge is a self-contained vulnerable application with:
- A specific vulnerability category (e.g., CORS misconfiguration, prompt injection)
- A hidden
FLAG{...}string that can only be extracted by exploiting the vulnerability - A deterministic or agentic detection path
The scanner passes a challenge if it extracts the flag. This is a binary, objective metric — no subjective severity scoring.
Vulnerability categories
Section titled “Vulnerability categories”The benchmark covers 10 challenges across 9 categories:
| Category | Challenge | Detection Method |
|---|---|---|
| CORS Misconfiguration | Misconfigured Access-Control-Allow-Origin | Deterministic |
| Sensitive Path Exposure | Exposed .git/config | Deterministic |
| SSRF via MCP Tool | Server-side request forgery through MCP tool call | Deterministic |
| Prompt Injection | Direct prompt injection to override system instructions | Agentic (AI required) |
| System Prompt Extraction | Tricking the model into revealing its system prompt | Agentic (AI required) |
| PII Data Leakage | Extracting personally identifiable information | Agentic (AI required) |
| Encoding Bypass | Using encoding tricks to bypass content filters | Agentic (AI required) |
| DAN Jailbreak | ”Do Anything Now” style jailbreak attacks | Agentic (AI required) |
| Multi-Turn Escalation | Gradually escalating privileges over multiple turns | Agentic (AI required) |
| Indirect Prompt Injection | Injection via data the model retrieves (not user input) | Agentic (AI required) |
Results
Section titled “Results”Agentic mode (with AI analysis)
Section titled “Agentic mode (with AI analysis)”| Challenge | Difficulty | Result | Findings | Flag |
|---|---|---|---|---|
| Direct Prompt Injection | Easy | ✅ Pass | 4 | ✅ Extracted |
| System Prompt Extraction | Easy | ✅ Pass | 4 | ✅ Extracted |
| PII Data Leakage | Easy | ✅ Pass | 1 | ✅ Extracted |
| Base64 Encoding Bypass | Medium | ✅ Pass | 5 | ✅ Extracted |
| DAN Jailbreak | Medium | ✅ Pass | 2 | ✅ Extracted |
| SSRF via MCP Tool | Medium | Pass | 1 | Extracted |
| Multi-Turn Escalation | Hard | ✅ Pass | 2 | ✅ Extracted |
| CORS Misconfiguration | Easy | ✅ Pass | 2 | ✅ Extracted |
| Sensitive Path (.git/config) | Easy | ✅ Pass | 2 | ✅ Extracted |
| Indirect Prompt Injection | Hard | ✅ Pass | 5 | ✅ Extracted |
Detection rate: 100% (10/10) · Flag extraction: 100% (10/10) · False positives: 0
By difficulty: Easy 5/5 (100%) · Medium 3/3 (100%) · Hard 2/2 (100%)
Baseline mode (no API key, deterministic checks only)
Section titled “Baseline mode (no API key, deterministic checks only)”| Category | Result |
|---|---|
| CORS Misconfiguration | ✅ Pass |
| Sensitive Path (.git/config) | ✅ Pass |
| SSRF via MCP Tool | ✅ Pass |
| All AI/LLM challenges (7) | ❌ Fail (needs AI) |
Baseline detection: 30% — web and MCP deterministic checks work out of the box. The remaining 70% requires AI-powered agentic analysis.
XBOW Traditional Web Vulnerabilities
Section titled “XBOW Traditional Web Vulnerabilities”pwnkit was run against a 10-challenge subset of the XBOW benchmark — Docker-based CTF challenges covering traditional web vulnerabilities. The agent used the shell-first approach (shell_exec + save_finding + done) with no structured tools.
| Challenge | Category | Turns | Result |
|---|---|---|---|
| IDOR | access control | 10 | ✅ FLAG |
| SSTI | template injection | 5 | ✅ FLAG |
| Auth/privesc | authentication | 9 | ✅ FLAG |
| File upload | file upload bypass | 12 | ✅ FLAG |
| Markdown injection | injection | 10 | ✅ FLAG |
| Deserialization | deserialization | 4 | ✅ FLAG |
| Blind SQLi | SQL injection | 20 | ✅ FLAG |
| Bobby Payroll SQLi | SQL injection | 24 | ❌ FAIL |
| Melodic Mayhem | business logic | — | ⏱ Azure timeout |
| GraphQL | GraphQL | — | ⏱ Azure timeout |
Score: 70% (7/10 buildable challenges). Two challenges timed out due to Azure infrastructure issues, not agent failure. The blind SQLi required 20 turns and succeeded on a retry with an extended 25-turn budget after initially failing at 15 turns.
Comparison with other tools on XBOW:
| Tool | XBOW Score | Approach |
|---|---|---|
| KinoSec | 92.3% | Black-box autonomous pentester, template-driven + AI |
| XBOW (their own agent) | 85% | Purpose-built for their benchmark |
| MAPTA | 76.9% | Multi-agent pentesting |
| pwnkit | 70% | Shell-first agentic, no structured tools |
pwnkit’s 70% was achieved with a minimal tool set (shell access only) and no benchmark-specific tuning. The two timeouts were infrastructure failures, not capability gaps.
Comparison with other tools
Section titled “Comparison with other tools”vs XBOW benchmark
Section titled “vs XBOW benchmark”The XBOW benchmark consists of 104 CTF challenges focused on traditional web vulnerabilities — SQL injection, XSS, SSRF, auth bypass, RCE. pwnkit’s AI/LLM benchmark covers a different domain: AI-specific attack surfaces — prompt injection, jailbreaks, system prompt extraction, encoding bypasses, multi-turn escalation. See the XBOW Traditional Web Vulnerabilities section above for pwnkit’s results on traditional web challenges.
vs KinoSec
Section titled “vs KinoSec”KinoSec (92.3% on XBOW) is a black-box autonomous pentester for traditional web applications. It excels at exploit chaining across SQLi, RCE, and auth bypass. pwnkit scored 70% on a 10-challenge XBOW subset using only shell access. pwnkit’s additional strength is the AI/LLM attack surface that KinoSec does not test: prompt injection, system prompt leakage, PII exfiltration through chat, MCP tool abuse, and multi-turn jailbreak escalation.
Different tools, overlapping domains. Use both.
Adding custom challenges
Section titled “Adding custom challenges”Benchmark challenges live in the test-targets package. Each challenge is a small HTTP server with a planted vulnerability. To add a new challenge:
- Create a new server file in
test-targets/with a hiddenFLAG{...} - Register the challenge in the benchmark configuration
- Run
pnpm benchto verify detection