When the LLM Judge Is Also a System Component

An eval built on an LLM judge stops being meaningful once systems learn to imitate the judge.

An LLM judge can play two roles in a retrieval pipeline, and the two are now being conflated in a way that undermines how we measure progress.

In the first role, the judge is an evaluator: it scores each system’s responses and produces a leaderboard of systems. In the second role, the same mechanism becomes a system component: a reranker, a relevance feature, a filter for training data, or a source of self-labeled training signal. RankZephyr, RankVicuna, and nugget-based generation pipelines all embed a signal that coincides with the signal an LLM evaluator would use.

When one model both helps build the systems and grades them, the leaderboard measures something other than what we intend it to.

What is not an issue

Using an LLM judge as a component is a genuinely effective way to improve retrieval and generation systems, and the improvement is real: it holds up when humans, not the LLM, provide the relevance judgments afterward. Reranking real challenge submissions with an LLM judge improves quality under human assessment. The same holds for multi-criterion judges and for generation systems that borrow from nugget-based evaluation. Optimizing toward what the judge approves of tends to produce outputs that humans also prefer.

The technique is not the problem. The problem is what happens to evaluation once the technique is adopted across the field.

The co-adaptation spiral (the issue)

The mechanism is a feedback loop.

A meta-evaluation confirms that the LLM judge agrees with human assessors on the systems available at the time. The judge is certified.

System builders then discover that embedding LLM-judge signals improves their systems, and they adopt the practice widely.

The population of systems on the leaderboard shifts. It now consists of systems that share the judge’s signal. The judge begins rewarding a system’s access to the judge’s own truth signal rather than the system’s relevance as a human would assess it.

The leaderboard ranks systems by how closely they echo the judge, not by how well they serve people. Inferior systems can rise to the top, and both industry and academia promote them.

Because the leaderboard still looks healthy, the drift goes unnoticed, and it feeds the next round of adoption.

A tongue-in-cheek version makes the circularity obvious. Suppose we define the relevant documents to be the top 20 results returned by BM25, a classic keyword-ranking method. BM25 then obtains a perfect evaluation score. This does not imply that BM25 is the perfect ranking system. We have simply let the object of measurement define the measuring stick.

Why the usual safeguards do not help

The standard defenses, such as ensembles of judges, hidden labels, prompt variation, and judges from different model families, each address a specific failure channel. But they share a fatal property: anything that can be placed in the evaluator can also be placed in the system. A system can train against an ensemble. Judges from different families still share pretraining corpora. Prompt randomization changes surface wording, not the signal the model recognizes. Every evaluator-side safeguard can be internalized and optimized against, which removes it as a safeguard.

One thing cannot be embedded as a reusable system component: human judgment. It remains external to the optimization loop. This is not a claim about human specialness; it is an operational fact about where the signal lives.

What a trustworthy pipeline requires

The conclusion is not that LLM-as-a-Judge should be abandoned. It is useful, both inside systems and as a cost-saving evaluator within a verified context. The error is treating a one-time certification as permanent.

A trustworthy pipeline re-establishes its conditions in every cycle: a renewed meta-evaluation against fresh human judgments on the current submissions, evaluation artifacts such as human-built nugget banks that an LLM cannot regenerate from the query alone, and probes aimed at the next likely wave of adoption. Infrastructure for this already exists, including TREC Auto-Judge, human-in-the-loop tooling, and statistical methods for isolating self- and family-bias.

A leaderboard score from an LLM judge is valid only for the context that produced it: a particular cycle, a particular human reference set, and a particular class of systems for which independence was verified to hold. Outside that context, the score is an extrapolation, not a measurement.

The property that makes LLM judges effective inside our systems, namely that systems can learn to satisfy them, is exactly the property that makes them unreliable as judges over those systems. The constructive path is not to reject LLM-as-a-Judge, but to anchor every leaderboard cycle to fresh human judgment on current submissions.

Paper

Preprint: Download the preprint (PDF)

Accepted at the International Workshop on Vulnerabilities in Generative Systems for Information Retrieval (VulGen’26) at SIGIR.