Online Appendix for Paper “ENT Rank”

ENT Rank: Retrieving Entities for Topical Information Needs through Entity-Neighbor-Text Relations.

Related work has demonstrated the helpfulness of utilizing information about entities in text retrieval; here we explore the converse: Utilizing information about text in entity retrieval. We model the relevance of Entity-Neighbor-Text (ENT) relations to derive a learning-to-rank-entities model.

We focus on the task of retrieving (multiple) relevant entities in response to a topical information need such as “Zika fever”. The ENT Rank model is designed to exploit semi-structured knowledge resources such as Wikipedia for entity retrieval. The ENT Rank model combines (1) established features of entity-relevance, with (2) information from neighboring entities (co-mentioned or mentioned-on-page) through (3) relevance scores of textual contexts through traditional retrieval models such as BM25 and RM3.

Cite as:

Laura Dietz. 2019. 
ENT Rank: Retrieving Entities for Topical Information Needs through Entity-Neighbor-Text Relations.
In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’19), 
July 21–25, 2019, Paris, France. 
ACM, New York, NY, USA, 10 pages. 
https://doi.org/10.1145/3331184.3331257

Download the paper

Watch the talk at SIGIR 2019 at the ACM digital library.

Download Benchmark TREC CAR Entity ranking

We evaluate ENT rank on the Entity Retrieval Task of TREC CAR (offered in Y2). (The entity retrieval task is different from the Passage Retrieval Task, which is the main task at Complex Answer Retrieval!)

Official page of TREC CAR benchmarks:

TREC CAR v2.3 Data Releases

From the 2.3 release dowload the

Corpus of paragraphs (paragraphCorpus)
Knowledge base (unprocessedAllButBenchmark)
benchmarkY2test (topics only) and benchmarkY2test-auto-manual-qrels (assessments)
benchmarkY1train (topics and automatic qrels for page and section level)

Download Benchmark DBpediaV2-entity-CAR

DBpediaV2-entity-CAR

A sophisticated benchmark for entity retrieval with many baseline systems is DBpedia V2 entity. (Thanks to Krisztian Balog and Faegheh Hasibi for making it available!)

Since Text features are not computable on the original benchmarl DBpedia v2-entity, we projected the dataset onto TREC CAR’s entity ids (and the knowledge base in “all but benchmark”).

We determined matches between the DBpedia dump from 2015 (used in DBpedia v2) and the TREC CAR entity ids by (1) exact page name matches and (2) matches in one of the page’s redirects.

A few entities could not be aligned, because the pages were held out in the TREC CAR collection (because they are CAR queries) or because pages were deleted.

Code, Installation, Reproduction

ENT Rank Code
Instructions for installing/running/reproducing ENT rank.
Instructions for reproducing input run files (or download, see below).

Download Resulting Runs and Models

Download result runs and pre-trained models.

The archive contains a directory for each dataset benchmarkY1train benchmarkY2test as well as one for page-level experiments (which ignore the headings and produce a ranking for the page/query) and one for section-level experiments (producing a ranking for each heading). graex11 is our internal codename for ENT Rank.

Run files are indicated by suffixes.

*train.run : run file on the trained model, so evaluating training error
*test.run: run file produced with 5-fold cross validation, concatenating predictions on each holdout fold
*predict.run: run file produced after training on a different dataset

Model files are represented by *json files.

For details see information on output files on How to run ENT rank

Download edge contexts

The ENT rank candidate graph is built from paragraphs, pages, or sections. The archive contains the edgeContexts of either each type, with a *toc file for faster lookups.

edgeContexts.tar.gz

Download Input Runs

ENT rank takes sets of rankings over passages, entities, aspects as input and constructs the candidate graph and features are generated from the top 1000.

Input runs are available from the trec-car website:

Acknowledgement

This material is based upon work supported by the National Science Foundation under Grant No. 1846017.

Disclaimer

Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.