Skip to content
Xplore
Control

Every dollar tracked. Every score live. You know before your users do.

Real-time scoring in production. Cost by surface — agent, eval, trainer, cert. Drift detection that catches degradation the moment it starts.

Line-item
economics
Per-cycle cost split across agent, eval, trainer, and cert
Observe
Always-on
scoring
Live eval chains applied to production traffic
Observe
<1min
Threshold breach and drift alert targets
Alerting
4
surfaces
Cost and usage tracked as separate dimensions
Reporting
Control

You know the moment something breaks.

Every production run scores against your eval chain in real time. Latency, refusals, cost, quality — all visible from a single panel. When a threshold breaks, the alert fires immediately.

Production controls · live
Tone & Escalation enabled
agent: support-bot-v3
last 24h: 4,782 runs · avg: 0.88
⚠ drift: tone_score 0.91 → 0.79
Query accuracy enabled
agent: bi-analyst-v2
last 24h: 1,203 runs · avg: 0.92
alerts: 0
Citation quality enabled
agent: research-agent-v1
last 24h: 671 runs · avg: 0.79
⚠ drift: source_quality 0.85 → 0.71
Cost Over Time — stacked area chart showing agent, eval, trainer, cert costs
Control

See where your budget goes.

Stacked cost breakdown by surface. See exactly where your budget goes — token events, API calls, evaluator runs, certification overhead. Set cost guard thresholds and get alerts before budgets overrun.

Control

Degradation caught. Retraining triggered. No manual step.

When context adherence drops from 0.78 to 0.61, you don't find out from a user complaint. The alert fires, the drift is logged, and retraining can trigger automatically — closing the loop from detection to improvement.

  • ·Context adherence, completeness, PII leak, latency SLA, token budget
  • ·Drift alerts — score 0.78 → 0.61 flagged instantly
  • ·Auto-retrain when thresholds breach — no manual intervention
0.78
−22%
Context adherence — before
0.61
alert
Context adherence — after drift
0.03
✓ pass
PII leak rate
1.8s
✓ within SLA
Latency p95
Control

Performance, quality, and safety — all in one view.

No separate dashboards. No manual data pulls. Every metric scored continuously against your thresholds.

Performance

Latency distributions (p50, p95, p99), token usage per run, API cost per call, throughput per minute. Thresholds configurable per agent. Alert fires when p95 latency exceeds your SLA for 3 consecutive runs.

Quality

Evaluation scores from your Forge eval chain run on every production request. Drift detection compares rolling 100-run averages against baseline. When quality drops — hallucination rate spikes, accuracy falls — you see it within minutes, not days.

Safety

Three detection layers: (1) Rule-based pattern matching catches known injection patterns and PII formats in real-time. (2) Boundary enforcement blocks tool calls the agent isn't authorized to make. (3) Adversarial probing runs on a sample of production traffic to test resistance. On violation: block the response, fire alert, log the full trace for review.