Which LLM-as-a-Judge Should I (Not) Adopt?

If you are building on an LLM or an agent, at some point you need to know whether the output is any good, and the convenient option is to have another LLM score it. This is about why that often does not work [1], and what to do instead.

Prompt-based Judges and the Circularity Issue

Be skeptical whenever a single prompt is used to evaluate LLM or agent output. In most setups, that judging prompt closely mirrors the prompt that generated the output in the first place. We call this circularity, and our work [2] shows it renders the judge ineffective.

You cannot prompt your way out of this! It is not a matter of a better rubric or a cleverer system message. It is a limitation of the LLMs we have today.

Circularity is also the biggest footgun in most eval and observability tooling, which leans on single-prompt judges, and that is exactly the part that breaks.

The failure is silent. A circular judge produces clean, confident, high numbers, because the grader shares the blind spots of the thing being graded. It is a bit like a 5th grader grading their own paper. The LLM is rubber-stamping its own system. You may think it is better than nothing, but I would argue it gives you a false sense of security.

The numbers on the dashboard will look high, even champagne-bottle-plopping-amazing! But they do not indicate quality. As a result of wrong numbers, you may deploy inferior AI systems or have your customers make bad decisions.

What Helps: Genuine Human Oversight

The more genuine human signals you incorporate into the evaluation, the more trustworthy the numbers become.

A practical and well established way to add that signal is nugget-based LLM judges [3,4,5,8]. Instead of asking a model “is this good?” and letting it do all the intellectual heavy-lifting, you break quality down into request-specific, checkable pieces of information, i.e. the nuggets, that a human can define and verify. The LLM then handles the matching at scale. We show that the nugget approach is an effective safeguard in the “Insider Knowledge” paper [6], assuming that nuggets are not AI generated.

This idea of nugget-judges predates the current LLM wave [4,7], but the advent of LLMs made this approach practically applicable.

Three places to start if you want to build it:

Nugget-based LLM judges use a Human-AI workflow where:

  1. LLMs for the low-level linguistic matching of nuggets to system answers. Here LLMs are very reliable, but this task is boring and daunting for humans.
  2. Human domain experts to indicate which pieces of information are essential to include. This is an intellectually stimulating task, and here LLMs would tend to produce half-baked, biased, and hallucinated suggestions.

The Open Question: Enabling Human Oversight

Nowadays, the difficult part is not matching nuggets against system output. It is how to get humans to provide oversight into nugget curation in a way that renders them genuinely accountable. We don’t need humans who just give a “LGTM” thumbs up [1] without being critical. At the same time, we don’t want the task to be boring, obnoxious, or so expensive that it is abandoned. This is the same design problem that appears in accountable oversight of automated evaluations, LLM traces, and agent behavior.

We need to support humans when building nugget banks and other evaluation artifacts, without blindsiding them and without allowing them to rubber-stamp the LLM/agent output. One promising direction is to let humans take the first step, then use AI to generalize their contribution and reduce repetitive work.

How to design these human-AI workflows is best answered case by case, in real deployments, rather than on a whiteboard. I am actively looking for industry collaborations on exactly this: concrete deployments of GenAI and agents where evaluation has to be trustworthy, scalable, and accountable. If this is a problem you are grappling with, I would be happy to compare notes and see where that gets us.

References

More at my publications page under the two series on LLM-as-a-Judge failure modes and on using LLMs for evaluation.