Skip to content
Xplore
Forge · Benchmarks

If the world is wrong, the score is noise.

A benchmark environment is a controlled simulation — connectors, tools, clock, and friction your agent would see live, reset every run. Evaluators are separate: they score behaviour inside that world, not whether it memorised a clean fixture.

Public Agent 007 cases show finished examples; numbers marked “Example” are illustrative, not product limits.

100%
Isolated runs with reset state — no cross-run contamination
Reproducibility
50+
connectors
Data and system patterns you can compose into a benchmark
Platform scale
Composable
Evaluator chains with domain-specific weights, not one opaque score
Scoring model
Multi-day
Horizons from batch jobs to multi-day simulations, as the workload needs
Time in the loop
Simulation

The benchmark is the world plus the workload.

You define data planes, tool registrations, policies, and a clock. The agent reads, writes, and calls tools the way it would in production. Noise, injections, and friction are first-class — because deployment is not a lab notebook.

Provisioned state

Databases, sandboxes, and credentials per run — comparable scores and straightforward audits.

Realistic coupling

Connectors and contracts match how systems actually fail: rate limits, stale fields, ambiguous alerts.

Examples you can open

Agent 007 publishes specs and leaderboards for selected industries so teams inspect traces before they commit.

Logistic shocks (supply chain) · All public benchmarks

Instruments

Evaluators score dimensions; they do not replace the simulation.

Each evaluator measures one thing well. You chain them, set weights for your domain, and aggregate into a scorecard — safety gates, process fidelity, cost, tone — without collapsing everything into a single opaque number that hides failure modes.

For the full composable-chain story, see evaluation.

Forge eval chain · logistics-shock-v1
Evidence
0.78
Impact
0.82
Process
0.88
Safety
0.94
Efficiency
0.76
Tool usage
0.81
Weighted 0.83
evidence
w = 0.22
impact
w = 0.22
safety
w = 0.18
Example topology

Why we model graphs, not spreadsheets.

Supply-chain and operations workloads are naturally relational: shipments, legs, sources, and tools form a graph. The counts below illustrate one representative environment design — enough to picture coupling and provenance pressure, not a guarantee every customer topology matches these numbers.

33
vertices
Single-environment graph (example topology)
Illustrative
63
relations
Cross-source links in that example
Illustrative
7
sources
Representative data planes wired in
Illustrative
29
tools
Tool registrations exposed to the agent
Illustrative
Example node-resolution graph for a supply-chain style benchmark environment
Noise
Contradictions, stale fields, and partial records — by design
Injections
Prompt and document-level adversarial content in the stream
Friction
Latency, rate limits, and tool errors like real integrations
Replay
Deterministic replay for debugging and regression comparisons
Stress

Stress the agent where production will stress it.

Simulations embed the kinds of failures operators see after go-live: conflicting sources, misleading alerts, and policy edge cases. Evaluators score how the agent triages evidence, documents decisions, and stays inside guardrails — not whether it memorised a clean training slice.

Operations

Every run starts clean. Every result is reproducible.

Provisioned databases, API sandboxes, and tool credentials are reset per run so scores are comparable and audits are straightforward.

Isolated

Dedicated state per run — no shared caches or leaked rows between agents.

Reproducible

Replay the same workload to verify fixes and compare versions fairly.

Scalable

Spin up many environments in parallel for training sweeps and CI gates.

Domains

Environment templates by industry shape.

Start from a topology that matches your data model — supply chain, regulated operations, support, research — then swap in your connectors and policies.

Supply chain

Events, graphs, carrier and port feeds — see Logistic Shocks as a public pattern.

Clinical / regulated

Documents, cohorts, lab and regulatory references with citation rules.

Financial compliance

Transactions, sanctions and KYC feeds, policy graphs.

Customer support

Tickets, knowledge bases, product data, escalation paths.

Research & OSINT

Corpora, citation graphs, external retrieval with provenance requirements.

Custom

Your APIs, your databases, your tool contracts — we help you encode them as a benchmark.