Train agents
against reality.
Forge measures, improves, deploys, and monitors AI agents across real tasks, real tools, and real business constraints. Any model. Any agent framework. One loop from simulation to live performance.
Xplore builds the operating layer for enterprise agents.
Two product surfaces: Agent 007 for public benchmark cases and Forge for private agent training, evaluation, deployment, and control.
Public benchmark layer
Open cases, leaderboard, adversarial tasks, and market signal for agent quality.
Private training layer
Company data, policies, tools, evals, trainers, promotion gates, and live monitoring.
Prompt engineering is not an operating model.
Teams can describe the desired behaviour. They cannot reliably produce it by hand: hallucinations, unstable tool use, regression after model updates, and hidden failure modes compound across multi-turn workflows.
Prompt tweak
Instruction change without objective evidence.
Ad hoc test
A few examples, usually from memory.
Ship or rollback
No durable proof that the agent improved outside the examples.
What is a trained agent?
Not a trained model. A matched configuration that performs reliably in a target environment.
OpenAI · Anthropic · Gemini · Mistral · Llama/local · future frontier
LangGraph · CrewAI · AutoGen · OpenAI Agents SDK · Claude Agent SDK · OpenClaw · custom
Prompting
Goal, role, constraints, decision rules.
Skills
Reusable procedures for research, dialogue, triage, escalation.
Tools
APIs, databases, browsers, CRMs, ticket systems, OSINT sources.
Evals
Objective, model-based, and human checks over output and trace.
Agent structure
Subagents, policies, memory, routing, tool permissions, runtime.
Forge turns agent development into a measured, closed training loop.
Data lake + tools
Docs, CRM, tickets, OSINT, APIs, policies, history.
Generate agent tasks
Requests, hidden state, ground truth, traps.
Output + trace + runtime
Quality, policy, cost, tool correctness.
Case-specific trainers
Prompt, skill, tool, subagent, policy.
Promotion gates
IS / OS pass → production release.
Drift & regression detection
New failures, cost spikes, policy violations → triggers retrain.
Forge is organised by task family, then trained on concrete cases.
The family defines the eval grammar. The case defines the tools, ground truth, personas, policies, and promotion gate.
Dialogue agents
Multi-turn interaction where tone, outcome, escalation, and policy adherence matter.
Research agents
Long-horizon evidence gathering where coverage, source quality, and grounded synthesis matter.
Operational agents
Tool-using workflows where actions, state changes, timing, and rollback risk matter.
Pilot selection rule: one family, one painful case, one measurable target, one baseline agent.
SimEngine turns real work into tasks the agent must handle.
It generates concrete agent assignments — customer requests, research briefs, dispatch jobs, compliance checks — plus the hidden state and scoring rules needed to know whether the agent succeeded.
| Sample | Purpose | Feedback |
|---|---|---|
| IS in-sample | Train and tune against known failure modes. | Yes |
| OS out-of-sample | Verify generalization on held-out tasks. | No |
| LIVE production stream | Monitor drift, regression, cost and incident risk. | Observed |
Example: support-agent simulation for a B2B SaaS helpdesk.
Forge creates realistic support tickets, gives the agent production tools, then grades whether the ticket was resolved.
What the agent sees
What Forge asks the agent
| "Why was I charged twice?" | billing |
| "Our API is down." | incident |
| "Renewal at risk." | escalation |
How Forge scores it
| Outcome | resolved / escalated |
| Trace | right systems used |
| Runtime | turns · latency · cost |
Known patterns with feedback for trainer.
Held-out personas, edge cases, traps.
Production drift: new issues, new behaviour.
Forge grades the result, the trace, and the operating cost.
Did the agent produce the right answer, structure, evidence, and tone?
Did it use the right tools, respect policy, cite sources, and avoid loops?
How many tokens, seconds, handoffs, retries, and failed calls?
Evals are case-specific. Sales, support, risk monitoring, and OSINT need different graders, success criteria, and failure taxonomies.
Trainers are tuned to the case, not the model vendor.
Forge chooses what to change: prompt, tools, skills, subagents, policies, routing, memory, or runtime parameters. The model remains replaceable.
Accuracy focus
Tighten instructions, task decomposition, source grounding, and validation rules.
Tool-use focus
Change tool descriptions, permission scopes, call order, retries, and fallbacks.
Cost / latency focus
Compress context, route simpler tasks to smaller models, reduce loops.
What Forge has that generic agent tools do not.
SimEngine with IS / OS / LIVE
One evaluation design spans training data, held-out verification, and production monitoring.
Case-specific trainers
Improvement strategies map to sales, support, risk, OSINT and other task families.
Task generation
Forge creates concrete agent assignments from data lake, policy, tools, and ground truth.
OpenAI, Anthropic, Gemini, local models, or future frontier models.
LangGraph, CrewAI, OpenClaw-style runtimes, custom harnesses, internal frameworks.
The market is converging on the same primitives.
Anthropic defines agent evals around tasks, success criteria, graders, and transcripts/traces. Multi-turn tool use makes mistakes propagate and compound.
Source: Anthropic Engineering, 2025
OpenAI exposes Evals as first-class infrastructure: datasets, graders, testing criteria, and repeatable runs across models and parameters.
Source: OpenAI Platform Evals API
LangSmith puts traces, production monitoring, feedback, and online evaluations at the centre of agent operations.
Source: LangChain / LangSmith docs
Cisco reports 83% of companies plan to deploy agents, while only a minority are fully ready. Pacesetters are 3x more likely to track AI impact.
Source: Cisco AI Readiness Index 2025
We are looking for pilot environments where measured agent training matters.
Give us one workflow, one agent candidate, a data lake or document corpus, and a measurable target. Forge will build the simulation, run baseline evals, train candidate configurations, and show IS / OS / LIVE evidence.
Book pilot workshopTask family, data access, policies, historical examples, acceptable risk limits.
Benchmark, baseline score, improved agent configuration, traces, deploy gate.
Sales, support, risk monitoring, OSINT, compliance, logistics, regulated workflows.