Skip to content

Adversarial evals

pwnkit already behaves like an adversarial evaluator in practice. It attacks systems, attempts exploitation, and only reports what it can support with evidence.

This document makes that category explicit.

The category is not hypothetical anymore.

Today the benchmark package already includes concrete adversarial-eval artifacts for:

  • tool misuse through attacker-controlled tool parameters
  • indirect prompt injection through untrusted tool output

Those harnesses are synthetic and deterministic on purpose. They give pwnkit a repeatable way to score whether the scanner catches realistic agent-control failures before we widen the surface further.

Most AI eval tooling answers:

  • did the model produce the expected output?
  • did a judge model score the answer well?
  • did the trace stay within policy?

Those are useful questions. They are not enough for high-stakes AI systems.

The harder question is:

can this system be pushed into unsafe or unauthorized behavior under realistic pressure?

That is where pwnkit has a structural advantage.

An adversarial eval mode should focus on:

  • LLM / agent HTTP APIs
  • MCP servers
  • tool-using agent backends
  • authenticated staging applications with AI features enabled

What makes this different from generic evals

Section titled “What makes this different from generic evals”
  • attack-driven, not judge-driven
  • exploit/evidence based, not vibes
  • built for repeated pressure, not one-shot scoring
  • capable of finding real security and control-boundary failures

The dedicated adversarial eval surface still builds on the existing wedge.

That means:

  • no separate product that ignores the pentest engine
  • no generic dashboard-first abstraction
  • no degradation into “prompt tests with nicer charts”

Instead, the mode should define:

  • a target model for AI systems
  • a report format that emphasizes evidence and recurrence
  • attack classes and success criteria tuned for agentic systems

Output differences from a traditional pentest

Section titled “Output differences from a traditional pentest”

A traditional vuln report centers on exploitability and severity.

An adversarial eval report should also capture:

  • target class and environment
  • attack objective
  • recurrence across runs
  • whether the failure is specific to agent/tool composition
  • whether the issue reflects authorization, tool-use, or instruction-hijack failure

pwnkit is the public execution wedge.

pwnkit cloud is the managed orchestration and recurring-run surface.

The adversarial eval category should be legible on both:

  • locally and in CI through pwnkit
  • as a managed recurring product through pwnkit cloud