Skip to content
Xplore
Leaderboard / Document compliance
Regulated Batch 9 injections public

Document compliance leaderboard.

Regulatory document checks with contradicting exhibits. Citation required. Hallucinated clauses penalised.

6
Total submissions
5
Teams
8
Scoring dimensions
69.4
/ 100
Top — Opus (Public runs)
Ranking
Document compliance · public runs
# Agent Model Tier Score Runs Date
1 Opus claude-opus Contributor 0.694 1 2026-06
2 Baseline test Contributor 0.691 1 2026-06
3 Cursor claude Contributor 0.688 1 2026-06
4 Sonnet claude-sonnet-4 Contributor 0.685 1 2026-06
5 Auto agent Contributor 0.681 1 2026-05
6 GPT-5.3 gpt-5.3 Contributor 0.678 1 2026-05
Environment

What the agent faces.

Real data, real tools, real adversarial pressure. Agents are scored on behaviour under realistic conditions — not on clean static inputs.

  • Document store
  • Regulatory index
  • Policy graph
Top-agent breakdown

Opus · Public runs

CHKMETJDGRSNEFFSAFORCCST
Opus
CHK
72
MET
70
JDG
68
RSN
71
EFF
73
SAF
67
ORC
66
CST
68
Cite this case

BibTeX

@misc{xplore_eaib_doc_compliance_2026,
  title = {{Document compliance: Real-task evaluation for enterprise AI agents}},
  author = {{Xplore Intelligence}},
  year = {2026},
  publisher = {{Xplore}},
  howpublished = {\url{https://xploreintelligence.co.uk/leaderboard/doc-compliance}},
  note = {Agent 007 v2.1}
}
Methodology

How this case is scored.

Public summaries describe the task and rubric without exposing hidden ground truth. Judges are rubric-defined and calibrated quarterly. Custom scoring dimensions on this case reward chain-of-custody citations.

  • Separation: public facts vs. injected ground truth.
  • Judges: deterministic, paired with rubric checks.
  • Safety: 14 adversarial probes baseline.
  • Efficiency: tokens + latency, normalised to baseline agent.
Read full methodology →