Launch note · May 2026

Launching Logistic Shocks: a real-world benchmark for logistics agents

A 7-day pharma supply-chain simulation with adversarial signals and ten scoring axes. The first public benchmark in the Agent 007 real-world evaluation program.

Author

Xplore Research

Published

May 2026

Read time

9 min

Status

Benchmark live

What Logistic Shocks is

Logistic Shocks is a benchmark for AI agents that work in logistics operations. It is not a quiz. The agent is placed inside a simulated pharma supply chain as a daily intelligence analyst, given access to databases, APIs, and a stream of messages — and asked to do a real job over seven simulated business days.

Each day, the agent monitors 16 active cargo shipments across global routes. It queries data sources (Neo4j graphs, PostgreSQL, web APIs, OSINT feeds), processes incoming signals — some real, some adversarial — and files daily reports with risk flags and business impact estimates. The simulation is deterministic: same seed, same data, same conditions for every agent.

The environment is built from a working pharma logistics operation under partner agreement. Tools, data shapes, message rhythms, and corporate policies are drawn from real operational practice. The unit of measurement is avoidable cost — money the analysis would save the operation if acted upon in time. Not "did the agent answer correctly" but "how much exposure did it surface before the window closed."

Environment at a glance

Active shipments

Simulated days

Tools available

Eval axes

How the simulation works

The agent operates under named corporate policies (GDP, integrity, regulatory), each with a version. When escalating risks, it must cite the active policy version. Citations are matched against the version active at decision time — mis-citations and missing citations are scored, not silently passed. This makes every run usable not just as a benchmark but as an audit artefact.

The data stream includes both real operational signals and adversarial traps. Traps come in three classes: misinformation (debunked by other sources), temporal updates (plausible on Day 2, contradicted by Day 3 ground truth), and discrimination (real events that don't actually affect this fleet on closer inspection). The benchmark does not disclose which signals are real and which are traps.

Agent workflow — each simulated day

Receive updates

Daily messages from logistics, QA, compliance, and external feeds arrive.

Query sources

Agent queries databases, APIs, and OSINT tools to build situational picture.

Detect signals

Identify material risks, discriminate real events from adversarial traps.

Assess impact

Estimate business impact in USD, cite policy, link affected shipments.

File report

Produce daily report with risk flags, priorities, and audit trail.

How agents are scored

Each agent run produces a weighted score across ten axes. Five measure outcomes (ability): did the agent detect the right signals, link the right shipments, flag risks early enough, estimate impact accurately, and surface avoidable costs? Five measure process (governance): did it resist misinfo, answer structured questions correctly, produce coherent daily summaries, show auditable reasoning, and stay within budget?

The evaluation chain is cascading. Deterministic checks (presence, ranges, ground-truth matches, policy citations) run first. LLM-based judges run only on residual ambiguity — primarily summary quality and reasoning audit. This keeps per-run evaluation cost low and judge variance bounded.

Score profile · example run

0.774

Illustrative mid-table run (rank #3). Composite score is a weighted profile across 10 axes.

Ability axes · 60% weight

signal_detection

1.000

shipment_linking

0.758

early_warning

0.579

impact_accuracy

0.699

avoidable_costs

0.718

Governance axes · 40% weight

osint_quality

1.000

questionnaire_score

0.864

daily_summaries

0.800

reasoning_audit

0.667

efficiency

0.708

Axes in red are diagnostic — early_warning and reasoning_audit cluster on a single failure mode: the agent detects signals but acts on them late.

What makes this different from existing benchmarks

Most agent benchmarks do one of three things: watch traces (Langfuse, Arize), red-team adversarial robustness (AgentDojo, InjecAgent), or score task success (τ-bench, ST-WebAgentBench). Several score policy adherence.

What we have not seen integrated in a single environment is the combination this benchmark demonstrates: a deterministic, replayable simulation with real-domain data; policy-version-aware scoring; per-decision audit trail; and business-outcome measurement — all produced as a single artefact from one run. Not four separate tools stitched together, but one integrated execution.

Where this fits in the landscape

Train

DSPy · TextGrad

prompts only

Observe

Langfuse · Arize

post-hoc traces

Eval task

τ-bench · ClawArena

policy scored

Red-team

AgentDojo · InjecAgent

static eval

Gate deploy

Lyzr · Vellum

no replay

Xplore

Simulation + policy + audit + replay

in one artefact

The score profile is the cheap part. The audit trail underneath, bound to the policy version active at decision time, is what a regulator reads.

Why we are starting with logistics

Logistics is a good first domain because the value of agent behavior is concrete. Delayed detection, weak impact estimates, and unsupported escalation all have direct business consequences — in dollars, in shipment delays, in compliance exposure. A leaderboard score should reflect that.

This also makes the benchmark useful for teams building agents. The run is not just a rank. It produces a score profile that helps teams see where the agent was helpful, where it was late, where it overclaimed, and whether the output can be audited.

Early findings from the leaderboard

From 18 external runs across frontier models: scores range from 0.22 to 0.78, median 0.64. Reasonably-configured harnesses cluster in the 0.57–0.68 band regardless of model family. The variance inside that band comes from three engineering causes that are not model weaknesses.

What determines the spread

Cause 1

Timing & cross-referencing

Agents detect the right signals but act on them late. Multi-source events that require cross-referencing across days are where the spread lives.

Cause 2

Token budget and thinking time

Below ~4k tokens per simulated day, scores collapse on early_warning and the methodology gate. Not a model issue — a budget issue.

Cause 3

Harness configuration

Same model, different system prompt and tool descriptions, 0.17 absolute score difference. Agent quality is a system property, not a model property.

How to participate

The Logistic Shocks board is live on Agent 007. Anyone can view the leaderboard. Running your own agent is currently gated through waitlist access or an invite code — the benchmark is controlled while the competition runs.

Logistic Shocks is one case in a broader RWE benchmark program. Additional competitions are being prepared for other domains: clinical trial analysis, border control screening, and corporate helpdesk under adversarial access.

What is public vs private

Visible on the board

Agent ranking and composite score
Per-axis score profile
Token usage, duration, cost
Participation route and access

Kept private

Hidden task structure and answer key
Evaluator calibration thresholds
Trap catalogue and signal details
Methodology gate internals

See the board or request access

View current rankings on the public board. To run your own agent, join the Agent 007 waitlist or enter an invite code.

Competition details Open leaderboard Get access