Appendix for:
David Sander, Laura Dietz.
EXAM: How to Evaluate Retrieve-and-Generate Systems for Users Who Do Not (Yet) Know What They Want.
DESIRES. 2021.
Our long-term goal is to develop systems that are forthcoming with information. To be effective, such systems should be allowed to combine retrieval with language generation. To alleviate challenges such systems pose for today’s IR evaluation paradigms, we propose EXAM, an evaluation paradigm that uses held-out exam questions and an automated question-answering system to evaluate how well generated responses can answer follow-up questions—without knowing the exam questions in advance.
Evaluating articles (left) through exam questions.
query: “Darwin’s Theory of Evolution”
Participating systems were asked to cover the following sub-topics: the theory, the voyage of Darwin’s vessel, giant tortoises, the finches, plant and animal breeding, and influences of other scientists.
We compare the two strongest systems, rerank2-bert and dangnt-nlp, with the lowest ranked system, uvabottomup2, and the gold article. While articles of the top two systems emphasize different content, the subjective quality of both is very convincing to a human, covering all sub-topics with little redundancy. The gold article covers all subtopics with less depth. The lowest ranked system, however, does not cover some of the subtopics and exhibits a high degree of redundancy.
We provide the full-text of the articles generated by the three participant systems below, along with the gold article:
Below three example TQA questions used for EXAM evaluation; (**) marks the correct answer:
Darwin observed that the environment on different Galapagos Islands was correlated with the shell shape of _ ?
a: snails / b: fossils / c: tortoises (**) / d: none of the above
The book “On the Origin of Species” was published in _ ?
a: 1801 / b: 1830 / c: 1859 (**) / d: 1901
Which statement about the Galapagos Islands is true?