Running a controlled experiment on LLM output gating: methodology, results, and what we'd do differently

The hypothesis

Hypothesis 1 (H1): applying a post-generation gate to LLM answers reduces the rate of false-allow errors: cases where the model produces an incorrect or unsupported answer and the system lets it through anyway.

The gate checks two things before releasing an answer:

Citation validity: every document ID cited in the answer must appear in the retrieved evidence set.
Lexical grounding: the answer must share sufficient word overlap with the retrieved content, above a configurable threshold.

If the answer fails either check, the system falls back instead of returning the answer.

Setup

Corpus: structured B2B SaaS documentation stored in Sanity, 50 labeled queries total (34 expected to allow, 16 expected to fallback).

Models tested: llama-3.1-8b-instant via Groq, gpt-4o-mini and o3-mini via OpenAI.

Conditions:

Condition A: retrieval + LLM, no gate.
Condition B: retrieval + LLM + gate (standard profile).

Metrics:

False-allow rate: answers that should fallback, but got allowed.
False-fallback rate: answers that should be allowed, but triggered fallback.

Each query was run under both conditions for each model. Output files were saved per model so no run overwrote a previous one.

Results

Model	False-allow A	False-allow B	False-fallback A	False-fallback B
llama-3.1-8b-instant	6.3%	6.3%	14.7%	38.2%
gpt-4o-mini	0.0%	0.0%	14.7%	17.6%
o3-mini	18.8%	18.8%	14.7%	14.7%

H1 was not supported for any model.

The gate had zero effect on false-allow rate across all three models. For llama, it substantially increased the false-fallback rate: it blocked correct answers at a much higher rate without catching any of the wrong ones.

Why the gate did not work

The gate’s two signals are structurally weak against the actual failure modes we measured.

Citation validity is easy to satisfy. A model that constructs a fluent answer with references to real document IDs in the retrieved set will pass this check even if the factual content diverges from those documents. The check confirms presence, not accuracy.

Lexical grounding fails in both directions. It over-blocks correct answers that paraphrase rather than quote the source. It under-blocks hallucinations that happen to use domain-consistent terminology from the corpus.

For gpt-4o-mini, the false-allow rate was already zero without the gate. The gate’s only effect was a small increase in false-fallbacks: it blocked a few correct paraphrases.

For o3-mini, the gate was completely transparent. The model’s answers, correct or not, consistently scored high on both signals. That result deserves a separate post.

What we would do differently

The root problem is that lexical overlap measures surface similarity, not semantic accuracy.

A replacement signal that would address the failure mode: check whether the answer is entailed by the retrieved content rather than whether it shares words with it. This can be done with a cross-encoder or a small NLI model, at the cost of an additional inference step.

A simpler intermediate step: use embedding similarity between the answer and each retrieved chunk, with a threshold on the maximum score rather than lexical overlap. This at least measures meaning rather than vocabulary.

The citation check is still worth keeping. It is cheap and catches the specific failure mode of a model citing documents outside the retrieved set entirely. It just cannot carry the full weight of a grounding decision on its own.