Knowing When we Know the RAG System Works

An Unexamined RAG Is Not Worth Interrogating

(Apologies to Socrates)

Most RAG systems in production today have not been seriously evaluated. Not because their builders are careless, but because we don’t actually have good tools for the job. The evaluation paradigms we inherited from information retrieval don’t cleanly transfer, and the LLM-as-a-judge methods filling the gap have known measurement problems.

At the Schloss Dagstuhl – Leibniz-Zentrum für Informatik (LZI) Seminar on RAG, Niklas Deckers, Maik Fröbe, Wojciech Kusa, Mark Sanderson, and I (Laura Dietz) worked through two questions that turn out to be much harder than they sound:

when a RAG system works, how do we know?
And when it fails, how do we find out?

The Cranfield methodology, with its per-document relevance judgments, stable collections, and benchmarks reused for years, held up through the scaling shift in IR. It does not hold up for RAG. Two different systems can answer the same query with responses that are equally good but phrased differently, ordered differently, and sourced differently. Per-document judgments don’t capture that, and per-response judgments are not reusable.

Online evaluation has its own asymmetry. IR could learn at scale from implicit click feedback. RAG produces fewer clicks, so that signal weakens. What RAG gains in exchange is explicit personalization: users state what they want in natural language, which IR never had cleanly.

LLM-as-a-judge is the default workaround, and it has three problems worth naming. Narcissism: judges prefer outputs that resemble their own generations. Circularity: the evaluator is often too close in training to the system under test. Memorization: benchmarks leak. None of these are hypothetical, and none have clean fixes.

Our proposal is narrower than a new evaluation framework. We want open-source tooling for exploratory examination of RAG systems, the kind that helps a researcher quickly see where and how a system breaks. Failure-mode discovery first, better metrics second, once we know what we’re measuring.

The full writeup is in Section 4.5 of the Dagstuhl 25391 report.

Related Publications: Kusa, Wojciech, Niklas Deckers, Maik Fröbe, Laura Dietz, Birte Platow, and Mark Sanderson. 2026. “Talmud-IR: A Talmud-Inspired Interface for Discussing RAG Response Quality.” In Proceedings of the 48th European Conference on Information Retrieval (ECIR 2026). https://www.cs.unh.edu/~dietz/papers/kusa2026talmud.pdf.
Demo: Nugget-based user interface for manually critiquing RAG summaries.
pdf - web-demo - code