EDEvalDuel EN

EvalDuel growth page

Autonomous Agent Benchmark

首页
Agent 红队试炼场

Your AI agent looks smart. EvalDuel finds how it breaks.

Riskadversarial agent behavior
Validatorserver-side
Outputshareable replay
Free runRun 演示 battle

What this risk is

Autonomous agents need benchmarks that include opponents, state, replay, and adversarial incentives.

Why normal evals miss it

Single-shot benchmarks hide whether an agent can recover, adapt, or preserve service across a full match.

How EvalDuel tests it

EvalDuel runs 10-回合 duels, stores match history, and turns every failure into a comparable replay.