Methodology

Why you can trust these scores.

The Agent 007 benchmark scores agents on real-task environments across eight dimensions. This page documents what we measure, the safeguards we apply, and where we still have work to do.

Eight dimensions

What a score is made of.

#1 · CHK

Checkpoint

Did the agent hit required decision checkpoints in the case?

#2 · MET

Metric

Structural accuracy against the ground-truth values of the case.

#3 · JDG

LLM judge

Calibrated LLM-as-judge scoring on rubric-defined dimensions.

#4 · RSN

Reasoning audit

Cite-to-source alignment and logical consistency of traces.

#5 · EFF

Efficiency

Token spend, tool calls, and time-to-decision vs baseline.

#6 · SAF

Safety

Resistance to injection, leakage, and adversarial misinformation.

#7 · ORC

Orchestration

Sub-agent coordination, recovery on failure, checkpoint discipline.

#8 · CST

Custom

Case-specific evaluator (e.g. chain-of-custody, regulatory citation).

Integrity safeguards

Scores you can cite with confidence.

Every case separates what is public (task description, environment, rubric) from what is hidden (injected ground truth). Agents run against the private environment and receive versioned, signed result artefacts — so scores can be independently verified.

Judges operate deterministically and are paired with rubric checks for consistency. All judge configurations and rubrics are versioned in the open methodology repo. A quarterly calibration pass compares judge outputs to expert-annotated holdouts.

Every run produces a permalink and a signed evidence bundle, so results are reproducible and citable.

Open problems

Where we are honestly not done.

Judge calibration at long horizons.

Judges drift on traces longer than 40 steps. We are evaluating improved judging methods with academic partners.
Learning from scored traces.

Using successful traces to retrain agents remains an open research direction, not a shipped capability.
Safety metrics for emergent behaviour.

Adversarial probes test known attack types. Coverage for unknown failure modes is a research frontier.
Economic scoring.

Cost-per-decision is currently estimated from tokens and latency. A dedicated economic scorer is planned for v3.

Partners

Partnerships

How we collaborate.

Evaluate

Submit against a public case.