Evaluate

Know exactly where agents fail — and what to fix.

Not a pass/fail checkbox. A multi-dimensional profile of agent behavior against your operational requirements.

Benchmarks

Simulate your agent's real work.

A benchmark is a full simulation: tasks, tools, data sources, and constraints that mirror your production environment. Your agent executes the workflow end to end — the same way it would in the real world.

See benchmark environments →

Evaluators

Score every dimension that matters.

An evaluator measures one dimension of agent output — grounding, policy compliance, tool usage, tone. Chain multiple evaluators together, weight them for your domain, and get a composite score that reflects your priorities.

See eval chains →

Benchmarks simulate. Evaluators score. Together they tell you exactly what your agent can do.

For development teams.

→

Pinpoint failures by dimension

Safety at 0.94 but process compliance at 0.55? You know exactly what to fix without guessing.
→

Regression detection before deployment

Every change is scored. If a prompt edit hurts safety, you see it immediately — not from user complaints.
→

Composable chain — add domain-specific checks

Chain statistical checks, rule validators, and LLM judges in any order. Weight each for your domain.

Forge eval chain · support-v3

Tone

0.91

Accuracy

0.87

Escalation

0.95

Policy

0.72

Efficiency

0.88

Safety

0.99

Weighted 0.88

Evaluate

Four evaluator families. Each covers a blind spot the others can't.

Different domains demand different quality definitions. A clinical agent needs safety at 0.30 weight. A logistics agent needs route accuracy. One fixed scoring system can't serve both.

Statistical

Latency distributions, token usage, cost per run, completion rates

Rule-based

Format compliance, required fields, escalation triggers, safety guardrails

LLM judges

Tone, factual accuracy, hallucination detection, reasoning quality

Domain-specific

Entity resolution, citation verification, financial impact, signal detection

Evaluate

Three aggregation methods — because risk models differ.

Weighted means hide safety failures behind high averages. Minimum aggregation catches them but penalizes breadth. The right method depends on your risk model.

Weighted mean

Each evaluator contributes proportionally to what matters most for your use case.

Minimum

One failed dimension fails the whole chain. No averaging away safety violations.

Min-of-weighted

Category minimums with weighted rollup. The strictest option for regulated domains.

Evaluate

Per-task overrides — because agents serve multiple stakeholders.

A single agent may handle compliance reviews and customer queries. These aren't the same job. Override weights merge with the base chain so each task carries its own definition of 'good enough.'

Base chain

Accuracyw 0.35

Safetyw 0.30

Efficiencyw 0.20

Task override · compliance-review

Accuracyw 0.25

Safetyw 0.50 ↑

Efficiencyw 0.10