From Lexical Gates to Semantic Gating: What Our Sanity Retrieval Experiments Actually Show

Why we revisited the experiment

Our earlier gate experiment tested lexical overlap and citation checks as post-generation controls.

The conclusion was directionally right: lexical grounding is a weak confidence signal. But the newer retrieval benchmark gives a clearer, stronger picture:

semantic retrieval improves in-domain relevance significantly
semantic retrieval without a confidence policy can over-retrieve on out-of-domain prompts
lexical retrieval remains useful as a selective fallback, not as the primary relevance engine

So this rewrite focuses on system behavior end-to-end, not only answer gating.

Experimental design (updated)

We moved from ad-hoc prompts to a versioned, labeled seed and reproducible study scripts.

Method in brief:

versioned labeled query seed
controlled A/B retrieval runs (semantic vs lexical)
post-hoc policy ablation for confidence gating

Key additions compared to earlier runs:

explicit negative controls (out-of-domain prompts)
bucketed query sets (keyword, paraphrase, semantic stress)
retrieval-level IR metrics (Precision@k, Recall@k, MRR, nDCG)
robustness metrics (false-positive rate on negatives)

What we learned

1. Semantic wins on in-domain quality

Across positive queries, semantic retrieval outperformed lexical retrieval on relevance metrics. This part was expected and now quantified more rigorously.

2. Semantic can fail hard on robustness

Without gating, semantic retrieval accepted too many out-of-domain queries as if they were in-scope. That is the core production risk.

3. Lexical is not enough, but still useful

Lexical retrieval has lower recall overall, but it is naturally more conservative and can reduce bad acceptance in ambiguous/out-of-domain situations. Used as fallback, it increases system resilience.

Policy study: what gate works best here

We ran a policy ablation on top of semantic outputs:

baseline always accept semantic
score thresholds
lexical-agreement-only
hybrid score-or-lexical-agreement policy

A mixed policy (score threshold plus lexical agreement fallback) produced the best quality/robustness tradeoff in this corpus.

That does not mean the exact threshold is universal. It means confidence policies must be tuned on your own corpus and evaluated continuously.

Practical recommendation

If you run Sanity retrieval in production:

Keep semantic retrieval as primary.
Add a confidence gate before accepting semantic context.
Keep lexical fallback in hybrid mode.
Maintain a versioned benchmark seed with negative controls.
Re-run the full experiment after content or retrieval changes.

This is the difference between a good demo and a reliable retrieval system.