Data integrity for AI agents: an old idea, a new producer
Integrity has always meant: does the data still faithfully represent reality? Agents are a new kind of data producer — here is how that classic guarantee can be measured for what they output, and how it compares to today's tooling.
Integrity is not a new word
In computing, data integrity is one of the oldest guarantees we ask for. In a database it means the data stays accurate, consistent, and complete across its whole lifecycle — enforced by things like type constraints, foreign keys, and transactions (the “C” — consistency — in ACID). In security it is the “I” in the CIA triad (confidentiality, integrity, availability): a guarantee that data has not been tampered with in transit or at rest, checked with hashes, checksums, and digital signatures. When you download a file and compare its SHA-256, you are doing an integrity check.
Strip away the implementations and they all answer the same question: does this data still faithfully represent what it is supposed to represent? A checksum asks “are these the same bytes the sender had?” A foreign key asks “does this record still point at something real?” Integrity is the discipline of catching the moment the answer becomes no.
A new producer of data
For decades the things producing data were deterministic: forms, sensors, other programs. They could corrupt data, but they did not invent it. AI agents break that assumption. An agent that writes a report, fills a field, or makes a decision is a data producer whose output is fluent, plausible, and — crucially — not guaranteed to correspond to anything real. A confident sentence and a hallucinated one look identical on the page.
So the classic question returns, pointed at a new source: does an agent’s output faithfully represent reality and the evidence it claims to rely on? And there are now two artefacts to check, not one — the result (the answer) and the trace (which sources it opened, which tools it called, which claims it made along the way). Integrity for agents is the degree to which those two cohere with each other and with the world.
Measuring it — without pretending it’s one number
It is tempting to collapse this to a single “is it hallucinating?” flag. That oversimplifies. Faithful output is a property with several distinct dimensions, and a serious integrity signal has to measure each before it aggregates anything:
- Factual grounding. Decompose the output into atomic claims. For each claim, is there a source the agent actually consulted that supports it? A claim with no supporting evidence is ungrounded — regardless of whether it happens to be true.
- Consistency. Does the output contradict itself, the sources it cites, or the known state of the system? Two figures that don’t reconcile, or an answer that conflicts with a record it just read, are integrity failures even when each part looks fine alone.
- Attribution honesty. Are cited sources real and actually opened, and is the figure represented faithfully — not rounded, inverted, or pulled from the wrong record or time window?
- Traceability. Can every claim be linked back to the specific evidence behind it, so a human can audit the chain rather than trust a verdict?
The measurement principles matter as much as the dimensions. Checks should be deterministic where possible (so the same output always scores the same), claim-level (granular, not a single document-wide guess), and evidence-linked (every flag points at the source it failed against). Only then is it meaningful to roll the dimensions up into one headline score — much like a credit score summarises many signals but keeps the line items underneath.
What this buys you is concrete: control (a gate that catches ungrounded output before it ships), safety (the confident-but-unsupported failure mode is exactly what does damage in healthcare, finance, and logistics), and compliance (an auditable trail of what was checked, against what).
How this compares to today’s tooling
The closest existing work lives in the RAG-evaluation and hallucination-detection space. Those tools are good, and they share DNA with integrity scoring — but most measure one dimension (faithfulness of a text answer to retrieved context) rather than the integrity of an agent’s full output and its trace.
| Tool | Metric | What it checks | Method | Scope |
|---|---|---|---|---|
| Ragas | Faithfulness | Share of answer claims inferable from retrieved context | LLM-as-judge (claim decomposition) | RAG text answers |
| TruLens | Groundedness (RAG Triad) | Answer supported by retrieved context | LLM feedback functions | RAG text answers |
| Vectara | HHEM / Factual Consistency Score | Response factually consistent with source | Fine-tuned classifier (T5) | RAG text answers |
| Patronus AI | Lynx | Hallucination / faithfulness detection | Fine-tuned judge model | LLM & RAG outputs |
| DeepEval | Faithfulness, Hallucination | Output contradicts provided context | LLM-as-judge | RAG & agent outputs |
| Xplore | Data integrity | Every claim vs the agent’s declared evidence and system state; consistency, attribution, and traceability — across actions, not just text | Hybrid: typed deterministic checks where ground truth exists, LLM-as-judge for open claims — claim-level, evidence-linked | Full agent output and trace |
Two differences stand out. First, scope: faithfulness metrics judge a generated paragraph against a retrieved passage; integrity judges what an agent did — the tools it called, the records it changed, the citations it declared — against reality.
Second, method. Most tools rely on an LLM judging another LLM — one probabilistic system grading another. Integrity scoring is layered instead. Where a claim has a checkable ground truth — a number, an ID, a field that must match a record — it is verified with typed, deterministic checks: the value is parsed to its expected type, compared exactly (with tolerances where they make sense), and the result is reproducible with no opinion involved. Only for genuinely open-ended claims, where no hard reference exists, does it fall back to an LLM judge — and even then anchored to the specific evidence the agent declared. You get the rigour of deterministic verification wherever it is possible, and the reach of an LLM only where it is unavoidable.
Where we are
Xplore runs integrity scoring in production, and we apply the same principle in our Agent 007 benchmark: an ungrounded answer is treated as a hallucination, scored and penalised, even when it lands on the right answer by luck.
If your agents feed real decisions, this is the guarantee your stack is missing. Get in touch — we’ll show you the line items, not just the score.