Xplore — Agent [re]training platform

Agent [re]training platform

Evaluate[Re]trainDeployControl Evaluate
production AI agents.

Your workflows. Your data. Your success metrics. One platform for the full agent lifecycle — from first scorecard to production confidence.

Book a demo How it works

logistics-agent · ticket SH-2607

Live reliability loop

live

TASK · 09:41 support · pharma logistics

“What's the new ETA for SH-2607? Delayed in Rotterdam, urgent for the clinical trial.”

Agent drafting v6 replied v6 under review applying patch replied v7

agent · drafting

agent v6 “ETA: 14 May.”

agent v6 · review “ETA: 14 May.”

agent · patch v6 → v7

agent v7 “ETA 16 May, per port log (09:38).”

Benchmark idle scoring v6 2 axes failing idle re-scoring v7

awaiting run…

grounding

0.54

policy

0.48

× shipment_status lookup skipped

× no source citation

awaiting v7…

grounding

0.92

policy

0.84

Trainer standby standby analyzing alert writing patch deployed v7

no alerts

⚠ Unsupported ETA · run_c4c0a479

+ require shipment_status.lookup before ETA
+ cite source timestamp in reply
~ escalate when delay > 24h

✓ AGENTS.md v7 deployed · gate passed

Task

Score

Alert

Patch

Recover

Faster dev cycles

From first evaluation to production-ready agent in days, not months of prompt engineering.

Safe in production

Every agent scored live. Regressions caught and fixed before users notice.

Simulate before you ship

Build realistic simulations of your agent's work. Score every dimension. Ship only what passes.

Evaluate

Know exactly where your agents fail.

Build a benchmark that simulates your agent's real work — then score it with composable evaluators. Safety, accuracy, tool usage, process compliance, cost, reasoning — 8 evaluator types, each weighted for your domain.

See Forge evaluation →

Forge eval chain · supply-chain-v7

Safety gate

0.94

Route accuracy

0.82

Process

0.75

Completeness

0.88

Efficiency

0.80

Tool usage

0.90

Weighted 0.85

safety_gate

w = 0.30

route_acc

w = 0.25

process

w = 0.25

Forge training run — IS/OS/meta fitness over automated iterations with per-category breakdown

Config diff

Iteration 14 → 15

trainer: openclaw · meta_mutations

+ tool: verify_source_citation

"Cross-check facts against original document"

~ rule: escalation_policy

threshold: 0.4 → 0.6

~ instruction: verification section

added: "Always cite page number"

score: 0.31 → 0.35 +0.04 promoted

[Re]train

Fix what's broken. Automatically.

Forge optimizes your agent against the scores that matter to you. Every iteration is measured on tasks from your environment — not generic benchmarks. Every change tracked: tools added, rules rewritten, prompts adjusted.

·Automated cycles, tested on out-of-sample tasks from your environment
·Full diff between versions — see what changed and why scores moved
·Scores drop → retraining starts on its own

See how training works →

Deploy

Agents go live when they're ready. Not before.

Four gates between training and production. Set a score threshold, let the best candidate win, or review manually. Every version tracked, every promotion reversible.

·Best performer goes live automatically — or you approve
·Score threshold gates — nothing below your bar ships
·One-click rollback to any previous version

See deployment controls →

Forge training run configuration — promote policy, trainer strategy, mutation knobs

Forge agent overview — Logistics v7, 148 runs, 6 versions, performance over time

Production controls · live

● Accuracy & Safety enabled

agent: logistics-v7

last 24h: 148 runs · avg: 0.71

⚠ drift: context_adherence 0.78 → 0.61

● RAG quality enabled

agent: chatbot-prod

last 24h: 2,341 runs · avg: 0.84

alerts: 0

● Cost guard enabled

agent: bi-analyst-v1

last 24h: 412 runs · avg: 0.91

alerts: 0

Cost Over Time — stacked area chart showing agent, eval, trainer, cert costs

Control

You know before your users do.

Every agent scored live in production. Drift detected the moment quality drops. Cost tracked per dollar — agent, eval, trainer, certification. When something breaks, retraining triggers automatically.