[Re]train

Agents that stop failing on your specific tasks.

Not "smarter agents in general." Targeted optimization — tools, prompts, rules, policies — under your constraints, for your KPIs.

For development teams.

→

No manual prompt iteration

Forge iterates automatically. You define what "good" looks like (evals), it finds the configuration that gets there.
→

Full-stack agent optimization

Not just prompts — tools, rules, data access, routing. At L3, the optimizer creates new tools autonomously.
→

Every change visible and reversible

Git-like diffs of the full agent structure. Score delta per change. Roll back anything.
→

Automatic re-optimization on degradation

When production scores drop, retraining triggers. No human in the loop unless you want one.

Fitness · IS / OS / meta

IS 0.374 OS 0.326 meta 0.349 gap −0.048

Config diff

Iteration 14 → 15

trainer: openclaw · meta_mutations

+ tool: verify_source_citation

"Cross-check facts against original document"

~ rule: escalation_policy

threshold: 0.4 → 0.6

~ instruction: verification section

added: "Always cite page number"

score: 0.31 → 0.35 +0.04 promoted

[Re]train

Five strategies — each addresses a different class of weakness.

Agent failures aren't uniform. Some are prompt-level mistakes, some are architectural. A single training strategy can't address both.

Manual

For regulated domains where every change requires human review before it enters the loop.

Self-improvement

Addresses failures the agent can diagnose itself — wrong tool selection, incomplete reasoning, missed edge cases.

Evolutionary

Parallel variants explore a large search space. Finds optima that sequential iteration misses.

Step-by-step

Isolates coupled dimensions. Improving safety often degrades throughput — one dimension at a time prevents oscillation.

Deep training

When the bottleneck isn't the prompt. Tool design, data access patterns, and workflow structure all affect quality.

[Re]train

Depth levels — because the bottleneck changes.

Early iterations fix prompt-level issues (L0). As quality improves, the bottleneck shifts to tool design (L2), workflow structure (L3), or multi-agent coordination (L5). Depth levels ensure the trainer reaches the actual constraint.

Training depth ladder

Prompt tuning

Edits instructions, rules, priorities

● live

Diagnostic training

Reads traces, finds root causes, then edits

● live

Architecture training

Changes agent workflow — adds steps, branches

● live

Tool creation

Writes new tools, extends agent capabilities

● live

Curriculum evolution

Improves the tests themselves

○ roadmap

Team synthesis

Creates specialized sub-agents

○ roadmap

[Re]train

The monitoring-to-training loop closes automatically.

Monitoring and training share the same evaluation infrastructure. Drift detection already has the eval chain, the task suite, and the baseline — retraining is another iteration of the same loop.

Auto-retrain trigger

Triggercert_drift > 0.10

Strategyself-improvement

Promote policythreshold ≥ 0.85

Max iterations20