Skip to content
Xplore
Evaluate · Open Research

Every case is a research contribution.

Open research on evaluation methodology, training dynamics, and agent self-improvement. Cases, metrics, and behavioral analytics as research artifacts — built with academic partners.

3
areas
Evaluation, training, behavioral analytics
Active research
7
cases
Publishable research environments
Agent 007
92+
agents
Behavioral data across runs
Open dataset
8
axes
Composable evaluation dimensions
Research framework
Evaluate

The science behind reliable evaluation.

Cascading evaluation chains. Scoring under non-determinism. LLM-as-judge calibration. Every methodology question faced in production becomes a published research contribution.

Evaluation methodology
  • ·Composable evaluation chains — cascading architecture
  • ·Scoring under non-determinism — reproducibility studies
  • ·LLM-as-judge calibration — alignment with human raters
  • ·Multi-axis weighting — domain-specific optimization
Agent training
  • ·Prompt tuning to tool creation — the training spectrum
  • ·IS/OS generalization — when does training transfer?
  • ·Evolutionary strategies for agent-level optimization
  • ·Promotion policies — when is an agent ready?
[Re]train

Understand why agents improve — and when they don't.

Training curves, mutation strategies, and generalization gaps are open research artifacts. Every Forge training run produces data points for studying how agents learn, plateau, and regress.

Control

Know how agents fail before they fail in production.

Failure mode taxonomies, safety characterization, multi-agent coordination patterns, drift dynamics. Thousands of runs become a research dataset for predicting how agents behave under real conditions.

Behavioral analytics
  • ·Failure mode taxonomies across domains
  • ·Safety characterization — injection resistance, access patterns
  • ·Multi-agent coordination and delegation patterns
  • ·Drift dynamics — why agents degrade over time
Evaluate

Co-create the next case with us.

Cases are built with academic and industry partners. Each case, metric, or data stream is a publishable artifact. Propose a collaboration.