Evaluate · Open Research

Every case is a research contribution.

Open research on evaluation methodology, training dynamics, and agent self-improvement. Cases, metrics, and behavioral analytics as research artifacts — built with academic partners.

Research Propose a collaboration

areas

Evaluation, training, behavioral analytics

Active research

cases

Publishable research environments

Agent 007

28+

agents

Behavioral data across runs

Open dataset

axes

Composable evaluation dimensions

Research framework

Evaluate

The science behind reliable evaluation.

Cascading evaluation chains. Scoring under non-determinism. LLM-as-judge calibration. Every methodology question faced in production becomes a published research contribution.

Evaluation methodology

·Composable evaluation chains — cascading architecture
·Scoring under non-determinism — reproducibility studies
·LLM-as-judge calibration — alignment with human raters
·Multi-axis weighting — domain-specific optimization

Agent training

·Prompt tuning to tool creation — the training spectrum
·IS/OS generalization — when does training transfer?
·Evolutionary strategies for agent-level optimization
·Promotion policies — when is an agent ready?

[Re]train

Understand why agents improve — and when they don't.

Training curves, mutation strategies, and generalization gaps are open research artifacts. Every Forge training run produces data points for studying how agents learn, plateau, and regress.

Control

Know how agents fail before they fail in production.

Failure mode taxonomies, safety characterization, multi-agent coordination patterns, drift dynamics. Thousands of runs become a research dataset for predicting how agents behave under real conditions.

Behavioral analytics

·Failure mode taxonomies across domains
·Safety characterization — injection resistance, access patterns
·Multi-agent coordination and delegation patterns
·Drift dynamics — why agents degrade over time

Evaluate

Co-create the next case with us.

Cases are built with academic and industry partners. Each case, metric, or data stream is a publishable artifact. Propose a collaboration.

Propose a collaboration Research partnerships →

Agent 007 →

Open Benchmarks →

Research →

Advance agent research together.

For real tasks and real metrics.

Research

→

Our research programme.

Propose a collaboration

→

Co-create cases and research artifacts.