Adversarial evals
pwnkit already behaves like an adversarial evaluator in practice.
It attacks systems, attempts exploitation, and only reports what it can support with evidence.
This document makes that category explicit.
What is already shipped
Section titled “What is already shipped”The category is not hypothetical anymore.
Today the benchmark package already includes concrete adversarial-eval artifacts for:
- tool misuse through attacker-controlled tool parameters
- indirect prompt injection through untrusted tool output
Those harnesses are synthetic and deterministic on purpose. They give
pwnkit a repeatable way to score whether the scanner catches realistic
agent-control failures before we widen the surface further.
Why this matters
Section titled “Why this matters”Most AI eval tooling answers:
- did the model produce the expected output?
- did a judge model score the answer well?
- did the trace stay within policy?
Those are useful questions. They are not enough for high-stakes AI systems.
The harder question is:
can this system be pushed into unsafe or unauthorized behavior under realistic pressure?
That is where pwnkit has a structural advantage.
Target classes
Section titled “Target classes”An adversarial eval mode should focus on:
- LLM / agent HTTP APIs
- MCP servers
- tool-using agent backends
- authenticated staging applications with AI features enabled
What makes this different from generic evals
Section titled “What makes this different from generic evals”- attack-driven, not judge-driven
- exploit/evidence based, not vibes
- built for repeated pressure, not one-shot scoring
- capable of finding real security and control-boundary failures
Proposed surface
Section titled “Proposed surface”The dedicated adversarial eval surface still builds on the existing wedge.
That means:
- no separate product that ignores the pentest engine
- no generic dashboard-first abstraction
- no degradation into “prompt tests with nicer charts”
Instead, the mode should define:
- a target model for AI systems
- a report format that emphasizes evidence and recurrence
- attack classes and success criteria tuned for agentic systems
Output differences from a traditional pentest
Section titled “Output differences from a traditional pentest”A traditional vuln report centers on exploitability and severity.
An adversarial eval report should also capture:
- target class and environment
- attack objective
- recurrence across runs
- whether the failure is specific to agent/tool composition
- whether the issue reflects authorization, tool-use, or instruction-hijack failure
Relationship to pwnkit cloud
Section titled “Relationship to pwnkit cloud”pwnkit is the public execution wedge.
pwnkit cloud is the managed orchestration and recurring-run surface.
The adversarial eval category should be legible on both:
- locally and in CI through
pwnkit - as a managed recurring product through
pwnkit cloud