EXAM: How to Evaluate Retrieve-and-Generate Systems for Users Who Do Not (Yet) Know What They Want

Appendix for:

David Sander, Laura Dietz.
EXAM: How to Evaluate Retrieve-and-Generate Systems for Users Who Do Not (Yet) Know What They Want.
DESIRES. 2021.

Paper

Talk Slides Part 1

Abstract

Our long-term goal is to develop systems that are forthcoming with information. To be effective, such systems should be allowed to combine retrieval with language generation. To alleviate challenges such systems pose for today’s IR evaluation paradigms, we propose EXAM, an evaluation paradigm that uses held-out exam questions and an automated question-answering system to evaluate how well generated responses can answer follow-up questions—without knowing the exam questions in advance.

Evaluating articles (left) through exam questions.

Data Set

Textbook Question Answering dataset of questions and gold articles (provided by AI2.)
TREC Complex Answer Retrieval (Y3) information retrieval challenge recycling paragraphs of Wikipedia to generate articles for titles of gold articles

Example Articles

query: “Darwin’s Theory of Evolution”

Participating systems were asked to cover the following sub-topics: the theory, the voyage of Darwin’s vessel, giant tortoises, the finches, plant and animal breeding, and influences of other scientists.

We compare the two strongest systems, rerank2-bert and dangnt-nlp, with the lowest ranked system, uvabottomup2, and the gold article. While articles of the top two systems emphasize different content, the subjective quality of both is very convincing to a human, covering all sub-topics with little redundancy. The gold article covers all subtopics with less depth. The lowest ranked system, however, does not cover some of the subtopics and exhibits a high degree of redundancy.

We provide the full-text of the articles generated by the three participant systems below, along with the gold article:

Example Questions

Below three example TQA questions used for EXAM evaluation; (**) marks the correct answer:

Darwin observed that the environment on different Galapagos Islands was correlated with the shell shape of _ ?
a: snails / b: fossils / c: tortoises (**) / d: none of the above
The book “On the Origin of Species” was published in _ ?
a: 1801 / b: 1830 / c: 1859 (**) / d: 1901
Which statement about the Galapagos Islands is true?
- 1. There are a total of sixteen Galapagos Islands (**)
- 1. The Galapagos Islands are located in the Atlantic Ocean
- 1. The Galapagos Islands were the last stop on Darwin’s voyage
- 1. The Galapagos Islands are inhabited only by giant tortoises