Why we revisited the experiment
Our earlier gate experiment tested lexical overlap and citation checks as post-generation controls.
The conclusion was directionally right: lexical grounding is a weak confidence signal. But the newer retrieval benchmark gives a clearer, stronger picture:
- semantic retrieval improves in-domain relevance significantly
- semantic retrieval without a confidence policy can over-retrieve on out-of-domain prompts
- lexical retrieval remains useful as a selective fallback, not as the primary relevance engine
So this rewrite focuses on system behavior end-to-end, not only answer gating.
Experimental design (updated)
We moved from ad-hoc prompts to a versioned, labeled seed and reproducible study scripts.
Method in brief:
- versioned labeled query seed
- controlled A/B retrieval runs (semantic vs lexical)
- post-hoc policy ablation for confidence gating
Key additions compared to earlier runs:
- explicit negative controls (out-of-domain prompts)
- bucketed query sets (keyword, paraphrase, semantic stress)
- retrieval-level IR metrics (Precision@k, Recall@k, MRR, nDCG)
- robustness metrics (false-positive rate on negatives)
What we learned
1. Semantic wins on in-domain quality
Across positive queries, semantic retrieval outperformed lexical retrieval on relevance metrics. This part was expected and now quantified more rigorously.
2. Semantic can fail hard on robustness
Without gating, semantic retrieval accepted too many out-of-domain queries as if they were in-scope. That is the core production risk.
3. Lexical is not enough, but still useful
Lexical retrieval has lower recall overall, but it is naturally more conservative and can reduce bad acceptance in ambiguous/out-of-domain situations. Used as fallback, it increases system resilience.
Policy study: what gate works best here
We ran a policy ablation on top of semantic outputs:
- baseline always accept semantic
- score thresholds
- lexical-agreement-only
- hybrid score-or-lexical-agreement policy
A mixed policy (score threshold plus lexical agreement fallback) produced the best quality/robustness tradeoff in this corpus.
That does not mean the exact threshold is universal. It means confidence policies must be tuned on your own corpus and evaluated continuously.
Practical recommendation
If you run Sanity retrieval in production:
- Keep semantic retrieval as primary.
- Add a confidence gate before accepting semantic context.
- Keep lexical fallback in hybrid mode.
- Maintain a versioned benchmark seed with negative controls.
- Re-run the full experiment after content or retrieval changes.
This is the difference between a good demo and a reliable retrieval system.