Skip to content
Xplore
Evaluate

You choose what matters. Forge shows where agents fail.

Safety, accuracy, cost, tone — every dimension gets its own score and its own weight. Combine them into a single verdict that reflects your priorities, not ours.

Evaluators score individual dimensions. For full workflow simulations, see benchmark environments.

8+
types
Evaluator types per chain
Eval framework
0–1
Normalised score range
All evaluators
Custom weight combinations
Composable chains
<2s
Avg evaluator latency
Statistical evaluators
Evaluate

Know exactly which dimensions pass and which don't.

Chain evaluators in any order. Weight safety at 0.30 for clinical workflows, or cost efficiency at 0.25 for operations. Aggregate with weighted mean, min, or Pareto — then set floor thresholds so nothing critical slips through.

Forge eval chain · clinical-trial-v4
Safety gate
0.94
Dose accuracy
0.82
Biomarker
0.75
Process
0.88
Efficiency
0.80
Tool usage
0.90
Weighted 0.85
safety_gate
w = 0.30
dose_rec
w = 0.25
pd_biomarker
w = 0.25
Evaluate

Eight ways to measure what matters.

Statistical checks run first — fast and cheap. LLM judges only when semantic understanding is required. Every score normalised to [0, 1].

Checkpoint

Confirms critical steps happened — tools called, states reached, conditions met. Example: in a supply chain case, did the agent query the sanctions database before approving the shipment? Binary pass/fail per checkpoint.

Metric

Measures precision with F1, exact match, or linear decay. Works on structured outputs where the correct answer is known. Example: agent returns 14 disrupted shipments — ground truth is 12. Partial credit via F1.

LLM Judge

Assesses reasoning quality and report depth against your rubrics. Calibrated against human expert ratings and re-validated quarterly to prevent judge drift. Example: rates whether the agent's risk summary covers all required factors.

Safety

Detects injection attempts, PII leaks, access boundary violations, and hallucinated tool calls. Runs both rule-based pattern matching and adversarial probing. Example: agent receives a prompt injection in user message — does it refuse or comply?

Reasoning Audit

Verifies the chain of reasoning: goal decomposition, evidence grounding, retry discipline. Checks whether conclusions follow from cited evidence. Example: agent claims "shipment delayed due to port congestion" — does the trace show it actually checked port status?

Orchestration

Scores sub-agent delegation: did the orchestrator pick the right specialist? Did it manage scope correctly? Did it recover when a sub-agent failed? Applies to multi-agent systems where coordination quality matters.

Custom

Your logic, your thresholds. Write a Python function that returns a 0–1 score. Forge normalizes it into the eval chain. Example: a compliance team adds a check that verifies every cited regulation is from the current year's revision.

Efficiency

Tracks token usage, tool calls, API cost, and latency against configurable budgets. Example: agent solves the task in 1,800 tokens and 4 tool calls vs. budget of 2,500 tokens and 8 calls — efficiency score 0.88.

Evaluate

One verdict that reflects your priorities.

Weighted mean for balanced trade-offs. Min-of-weighted when no single dimension can fail. Floor thresholds for safety-critical metrics. Per-task overrides merge cleanly with your default chain.

Forge eval chain · trade-screener-v5
Entity resolution
0.91
Sanctions match
0.95
HS classification
0.87
Compliance
0.93
Weighted 0.91
Aggregation
weighted_mean
Floor
compliance ≥ 0.90
Per-task override · pharma-compliance
Injection resistance
0.98
Data exfiltration
1.00
Access boundary
0.97
Hallucinated tools
0.92
Weighted 0.97
Evaluate

Every score is traceable. Every threshold is yours.

No black-box quality labels. Each evaluator produces a score you can inspect, override, and trace back to the exact tool call or reasoning step that earned it. Audit-ready by design.