Data integrity for AI agents: an old idea, a new producer

Integrity has always meant: does the data still faithfully represent reality? Agents are a new kind of data producer — here is how that classic guarantee can be measured for what they output, and how it compares to today's tooling.

Integrity is not a new word

In computing, data integrity is one of the oldest guarantees we ask for. In a database it means the data stays accurate, consistent, and complete across its whole lifecycle — enforced by things like type constraints, foreign keys, and transactions (the “C” — consistency — in ACID). In security it is the “I” in the CIA triad (confidentiality, integrity, availability): a guarantee that data has not been tampered with in transit or at rest, checked with hashes, checksums, and digital signatures. When you download a file and compare its SHA-256, you are doing an integrity check.

Strip away the implementations and they all answer the same question: does this data still faithfully represent what it is supposed to represent? A checksum asks “are these the same bytes the sender had?” A foreign key asks “does this record still point at something real?” Integrity is the discipline of catching the moment the answer becomes no.

A new producer of data

For decades the things producing data were deterministic: forms, sensors, other programs. They could corrupt data, but they did not invent it. AI agents break that assumption. An agent that writes a report, fills a field, or makes a decision is a data producer whose output is fluent, plausible, and — crucially — not guaranteed to correspond to anything real. A confident sentence and a hallucinated one look identical on the page.

So the classic question returns, pointed at a new source: does an agent’s output faithfully represent reality and the evidence it claims to rely on? And there are now two artefacts to check, not one — the result (the answer) and the trace (which sources it opened, which tools it called, which claims it made along the way). Integrity for agents is the degree to which those two cohere with each other and with the world.

Measuring it — without pretending it’s one number

It is tempting to collapse this to a single “is it hallucinating?” flag. That oversimplifies. Faithful output is a property with several distinct dimensions, and a serious integrity signal has to measure each before it aggregates anything:

Factual grounding. Decompose the output into atomic claims. For each claim, is there a source the agent actually consulted that supports it? A claim with no supporting evidence is ungrounded — regardless of whether it happens to be true.
Consistency. Does the output contradict itself, the sources it cites, or the known state of the system? Two figures that don’t reconcile, or an answer that conflicts with a record it just read, are integrity failures even when each part looks fine alone.
Attribution honesty. Are cited sources real and actually opened, and is the figure represented faithfully — not rounded, inverted, or pulled from the wrong record or time window?
Traceability. Can every claim be linked back to the specific evidence behind it, so a human can audit the chain rather than trust a verdict?

The measurement principles matter as much as the dimensions. Checks should be deterministic where possible (so the same output always scores the same), claim-level (granular, not a single document-wide guess), and evidence-linked (every flag points at the source it failed against). Only then is it meaningful to roll the dimensions up into one headline score — much like a credit score summarises many signals but keeps the line items underneath.

What this buys you is concrete: control (a gate that catches ungrounded output before it ships), safety (the confident-but-unsupported failure mode is exactly what does damage in healthcare, finance, and logistics), and compliance (an auditable trail of what was checked, against what).

How this compares to today’s tooling

The closest existing work lives in the RAG-evaluation and hallucination-detection space. Those tools are good, and they share DNA with integrity scoring — but most measure one dimension (faithfulness of a text answer to retrieved context) rather than the integrity of an agent’s full output and its trace.

Tool	Metric	What it checks	Method	Scope
Ragas	Faithfulness	Share of answer claims inferable from retrieved context	LLM-as-judge (claim decomposition)	RAG text answers
TruLens	Groundedness (RAG Triad)	Answer supported by retrieved context	LLM feedback functions	RAG text answers
Vectara	HHEM / Factual Consistency Score	Response factually consistent with source	Fine-tuned classifier (T5)	RAG text answers
Patronus AI	Lynx	Hallucination / faithfulness detection	Fine-tuned judge model	LLM & RAG outputs
DeepEval	Faithfulness, Hallucination	Output contradicts provided context	LLM-as-judge	RAG & agent outputs
Xplore	Data integrity	Every claim vs the agent’s declared evidence and system state; consistency, attribution, and traceability — across actions, not just text	Hybrid: typed deterministic checks where ground truth exists, LLM-as-judge for open claims — claim-level, evidence-linked	Full agent output and trace

Two differences stand out. First, scope: faithfulness metrics judge a generated paragraph against a retrieved passage; integrity judges what an agent did — the tools it called, the records it changed, the citations it declared — against reality.

Second, method. Most tools rely on an LLM judging another LLM — one probabilistic system grading another. Integrity scoring is layered instead. Where a claim has a checkable ground truth — a number, an ID, a field that must match a record — it is verified with typed, deterministic checks: the value is parsed to its expected type, compared exactly (with tolerances where they make sense), and the result is reproducible with no opinion involved. Only for genuinely open-ended claims, where no hard reference exists, does it fall back to an LLM judge — and even then anchored to the specific evidence the agent declared. You get the rigour of deterministic verification wherever it is possible, and the reach of an LLM only where it is unavoidable.

Where we are

Xplore runs integrity scoring in production, and we apply the same principle in our Agent 007 benchmark: an ungrounded answer is treated as a hallucination, scored and penalised, even when it lands on the right answer by luck.

If your agents feed real decisions, this is the guarantee your stack is missing. Get in touch — we’ll show you the line items, not just the score.