Skip to content
Xplore
[Re]train

Agents that stop failing on your specific tasks.

Not "smarter agents in general." Targeted optimization — tools, prompts, rules, policies — under your constraints, for your KPIs.

For development teams.

  • No manual prompt iteration

    Forge iterates automatically. You define what "good" looks like (evals), it finds the configuration that gets there.

  • Full-stack agent optimization

    Not just prompts — tools, rules, data access, routing. At L3, the optimizer creates new tools autonomously.

  • Every change visible and reversible

    Git-like diffs of the full agent structure. Score delta per change. Roll back anything.

  • Automatic re-optimization on degradation

    When production scores drop, retraining triggers. No human in the loop unless you want one.

Fitness · IS / OS / meta
1.0 0.0 iterations
IS 0.374 OS 0.326 meta 0.349 gap −0.048
Config diff
Iteration 14 15
trainer: openclaw · meta_mutations
+ tool: verify_source_citation
"Cross-check facts against original document"
~ rule: escalation_policy
threshold: 0.4 → 0.6
~ instruction: verification section
added: "Always cite page number"
score: 0.31 → 0.35 +0.04 promoted
[Re]train

Five strategies — each addresses a different class of weakness.

Agent failures aren't uniform. Some are prompt-level mistakes, some are architectural. A single training strategy can't address both.

Manual

For regulated domains where every change requires human review before it enters the loop.

Self-improvement

Addresses failures the agent can diagnose itself — wrong tool selection, incomplete reasoning, missed edge cases.

Evolutionary

Parallel variants explore a large search space. Finds optima that sequential iteration misses.

Step-by-step

Isolates coupled dimensions. Improving safety often degrades throughput — one dimension at a time prevents oscillation.

Deep training

When the bottleneck isn't the prompt. Tool design, data access patterns, and workflow structure all affect quality.

[Re]train

Depth levels — because the bottleneck changes.

Early iterations fix prompt-level issues (L0). As quality improves, the bottleneck shifts to tool design (L2), workflow structure (L3), or multi-agent coordination (L5). Depth levels ensure the trainer reaches the actual constraint.

Training depth ladder
L0
Prompt tuning
Edits instructions, rules, priorities
● live
L1
Diagnostic training
Reads traces, finds root causes, then edits
● live
L2
Architecture training
Changes agent workflow — adds steps, branches
● live
L3
Tool creation
Writes new tools, extends agent capabilities
● live
L4
Curriculum evolution
Improves the tests themselves
○ roadmap
L5
Team synthesis
Creates specialized sub-agents
○ roadmap
[Re]train

The monitoring-to-training loop closes automatically.

Monitoring and training share the same evaluation infrastructure. Drift detection already has the eval chain, the task suite, and the baseline — retraining is another iteration of the same loop.

Auto-retrain trigger
Triggercert_drift > 0.10
Strategyself-improvement
Promote policythreshold ≥ 0.85
Max iterations20

For the business.

Time to production: days, not months

Automated cycles from $4.91 (example run). Instead of 3 months of prompt engineering, Forge optimizes to your targets automatically.

Agents that don't break on edge cases

Out-of-sample testing on tasks the agent hasn't seen. You know it works beyond training data before you ship.

Self-healing in production

When the world changes — new policies, new tools, new data — retraining triggers automatically. The agent adapts.