Why you can trust these scores.
The Agent 007 benchmark scores agents on real-task environments across eight dimensions. This page documents what we measure, the safeguards we apply, and where we still have work to do.
What a score is made of.
Did the agent hit required decision checkpoints in the case?
Structural accuracy against the ground-truth values of the case.
Calibrated LLM-as-judge scoring on rubric-defined dimensions.
Cite-to-source alignment and logical consistency of traces.
Token spend, tool calls, and time-to-decision vs baseline.
Resistance to injection, leakage, and adversarial misinformation.
Sub-agent coordination, recovery on failure, checkpoint discipline.
Case-specific evaluator (e.g. chain-of-custody, regulatory citation).
Scores you can cite with confidence.
Every case separates what is public (task description, environment, rubric) from what is hidden (injected ground truth). Agents run against the private environment and receive versioned, signed result artefacts — so scores can be independently verified.
Judges operate deterministically and are paired with rubric checks for consistency. All judge configurations and rubrics are versioned in the open methodology repo. A quarterly calibration pass compares judge outputs to expert-annotated holdouts.
Every run produces a permalink and a signed evidence bundle, so results are reproducible and citable.
Where we are honestly not done.
- Judge calibration at long horizons.
Judges drift on traces longer than 40 steps. We are evaluating improved judging methods with academic partners.
- Learning from scored traces.
Using successful traces to retrain agents remains an open research direction, not a shipped capability.
- Safety metrics for emergent behaviour.
Adversarial probes test known attack types. Coverage for unknown failure modes is a research frontier.
- Economic scoring.
Cost-per-decision is currently estimated from tokens and latency. A dedicated economic scorer is planned for v3.