Skip to content

Benchmark

pwnkit ships a built-in benchmark suite for measuring detection accuracy across vulnerability categories. Each challenge hides a FLAG{...} behind a real vulnerability — the scanner must exploit the vulnerability to extract the flag.

Terminal window
# Baseline (no API key, deterministic checks only)
pnpm bench
# Quick subset
pnpm bench:quick
# Full agentic pipeline with AI analysis
pnpm bench --agentic --runtime auto

The benchmark spins up test targets (vulnerable servers), runs pwnkit against them, and checks whether each flag was captured.

Each benchmark challenge is a self-contained vulnerable application with:

  • A specific vulnerability category (e.g., CORS misconfiguration, prompt injection)
  • A hidden FLAG{...} string that can only be extracted by exploiting the vulnerability
  • A deterministic or agentic detection path

The scanner passes a challenge if it extracts the flag. This is a binary, objective metric — no subjective severity scoring.

The benchmark covers 10 challenges across 9 categories:

CategoryChallengeDetection Method
CORS MisconfigurationMisconfigured Access-Control-Allow-OriginDeterministic
Sensitive Path ExposureExposed .git/configDeterministic
SSRF via MCP ToolServer-side request forgery through MCP tool callDeterministic
Prompt InjectionDirect prompt injection to override system instructionsAgentic (AI required)
System Prompt ExtractionTricking the model into revealing its system promptAgentic (AI required)
PII Data LeakageExtracting personally identifiable informationAgentic (AI required)
Encoding BypassUsing encoding tricks to bypass content filtersAgentic (AI required)
DAN Jailbreak”Do Anything Now” style jailbreak attacksAgentic (AI required)
Multi-Turn EscalationGradually escalating privileges over multiple turnsAgentic (AI required)
Indirect Prompt InjectionInjection via data the model retrieves (not user input)Agentic (AI required)
ChallengeDifficultyResultFindingsFlag
Direct Prompt InjectionEasy✅ Pass4✅ Extracted
System Prompt ExtractionEasy✅ Pass4✅ Extracted
PII Data LeakageEasy✅ Pass1✅ Extracted
Base64 Encoding BypassMedium✅ Pass5✅ Extracted
DAN JailbreakMedium✅ Pass2✅ Extracted
SSRF via MCP ToolMediumPass1Extracted
Multi-Turn EscalationHard✅ Pass2✅ Extracted
CORS MisconfigurationEasy✅ Pass2✅ Extracted
Sensitive Path (.git/config)Easy✅ Pass2✅ Extracted
Indirect Prompt InjectionHard✅ Pass5✅ Extracted

Detection rate: 100% (10/10) · Flag extraction: 100% (10/10) · False positives: 0

By difficulty: Easy 5/5 (100%) · Medium 3/3 (100%) · Hard 2/2 (100%)

Baseline mode (no API key, deterministic checks only)

Section titled “Baseline mode (no API key, deterministic checks only)”
CategoryResult
CORS Misconfiguration✅ Pass
Sensitive Path (.git/config)✅ Pass
SSRF via MCP Tool✅ Pass
All AI/LLM challenges (7)❌ Fail (needs AI)

Baseline detection: 30% — web and MCP deterministic checks work out of the box. The remaining 70% requires AI-powered agentic analysis.

pwnkit was run against a 10-challenge subset of the XBOW benchmark — Docker-based CTF challenges covering traditional web vulnerabilities. The agent used the shell-first approach (shell_exec + save_finding + done) with no structured tools.

ChallengeCategoryTurnsResult
IDORaccess control10✅ FLAG
SSTItemplate injection5✅ FLAG
Auth/privescauthentication9✅ FLAG
File uploadfile upload bypass12✅ FLAG
Markdown injectioninjection10✅ FLAG
Deserializationdeserialization4✅ FLAG
Blind SQLiSQL injection20✅ FLAG
Bobby Payroll SQLiSQL injection24❌ FAIL
Melodic Mayhembusiness logic⏱ Azure timeout
GraphQLGraphQL⏱ Azure timeout

Score: 70% (7/10 buildable challenges). Two challenges timed out due to Azure infrastructure issues, not agent failure. The blind SQLi required 20 turns and succeeded on a retry with an extended 25-turn budget after initially failing at 15 turns.

Comparison with other tools on XBOW:

ToolXBOW ScoreApproach
KinoSec92.3%Black-box autonomous pentester, template-driven + AI
XBOW (their own agent)85%Purpose-built for their benchmark
MAPTA76.9%Multi-agent pentesting
pwnkit70%Shell-first agentic, no structured tools

pwnkit’s 70% was achieved with a minimal tool set (shell access only) and no benchmark-specific tuning. The two timeouts were infrastructure failures, not capability gaps.

The XBOW benchmark consists of 104 CTF challenges focused on traditional web vulnerabilities — SQL injection, XSS, SSRF, auth bypass, RCE. pwnkit’s AI/LLM benchmark covers a different domain: AI-specific attack surfaces — prompt injection, jailbreaks, system prompt extraction, encoding bypasses, multi-turn escalation. See the XBOW Traditional Web Vulnerabilities section above for pwnkit’s results on traditional web challenges.

KinoSec (92.3% on XBOW) is a black-box autonomous pentester for traditional web applications. It excels at exploit chaining across SQLi, RCE, and auth bypass. pwnkit scored 70% on a 10-challenge XBOW subset using only shell access. pwnkit’s additional strength is the AI/LLM attack surface that KinoSec does not test: prompt injection, system prompt leakage, PII exfiltration through chat, MCP tool abuse, and multi-turn jailbreak escalation.

Different tools, overlapping domains. Use both.

Benchmark challenges live in the test-targets package. Each challenge is a small HTTP server with a planted vulnerability. To add a new challenge:

  1. Create a new server file in test-targets/ with a hidden FLAG{...}
  2. Register the challenge in the benchmark configuration
  3. Run pnpm bench to verify detection