Skip to content
Xplore
Evaluate

Know exactly where agents fail — and what to fix.

Not a pass/fail checkbox. A multi-dimensional profile of agent behavior against your operational requirements.

Benchmarks
Simulate your agent's real work.

A benchmark is a full simulation: tasks, tools, data sources, and constraints that mirror your production environment. Your agent executes the workflow end to end — the same way it would in the real world.

Evaluators
Score every dimension that matters.

An evaluator measures one dimension of agent output — grounding, policy compliance, tool usage, tone. Chain multiple evaluators together, weight them for your domain, and get a composite score that reflects your priorities.

Benchmarks simulate. Evaluators score. Together they tell you exactly what your agent can do.

For development teams.

  • Pinpoint failures by dimension

    Safety at 0.94 but process compliance at 0.55? You know exactly what to fix without guessing.

  • Regression detection before deployment

    Every change is scored. If a prompt edit hurts safety, you see it immediately — not from user complaints.

  • Composable chain — add domain-specific checks

    Chain statistical checks, rule validators, and LLM judges in any order. Weight each for your domain.

Forge eval chain · support-v3
Tone
0.91
Accuracy
0.87
Escalation
0.95
Policy
0.72
Efficiency
0.88
Safety
0.99
Weighted 0.88
Evaluate

Four evaluator families. Each covers a blind spot the others can't.

Different domains demand different quality definitions. A clinical agent needs safety at 0.30 weight. A logistics agent needs route accuracy. One fixed scoring system can't serve both.

12
Statistical

Latency distributions, token usage, cost per run, completion rates

11
Rule-based

Format compliance, required fields, escalation triggers, safety guardrails

10
LLM judges

Tone, factual accuracy, hallucination detection, reasoning quality

8+
Domain-specific

Entity resolution, citation verification, financial impact, signal detection

Evaluate

Three aggregation methods — because risk models differ.

Weighted means hide safety failures behind high averages. Minimum aggregation catches them but penalizes breadth. The right method depends on your risk model.

Weighted mean

Each evaluator contributes proportionally to what matters most for your use case.

Minimum

One failed dimension fails the whole chain. No averaging away safety violations.

Min-of-weighted

Category minimums with weighted rollup. The strictest option for regulated domains.

Evaluate

Per-task overrides — because agents serve multiple stakeholders.

A single agent may handle compliance reviews and customer queries. These aren't the same job. Override weights merge with the base chain so each task carries its own definition of 'good enough.'

Base chain
Accuracyw 0.35
Safetyw 0.30
Efficiencyw 0.20
Task override · compliance-review
Accuracyw 0.25
Safetyw 0.50 ↑
Efficiencyw 0.10

For the business.

Reduce support cost

Score accuracy and tone before deployment. Catch agents that will generate escalations — before they do.

Compliance confidence

Every agent version scored against your policy requirements. Audit trail for regulators. Certifiable.

Vendor selection

Compare 3 LLM providers on the same benchmark. Data-driven decision instead of vibes.