A Large Test Collection for Entity Aspect Linking

Authors: Jordan Ramsdell and Laura Dietz.

The test collection and all associated data are released under a Creative Commons Attribution-ShareAlike 4.0 International License. .

Entity-aspect-linking-2020 Collection

The collection of 1 million EAL instances is provided in the following (disjoint) partitions.

nanni-test: (18289 EAL instances / 162 target entities) Entity aspects associated with target entities used in Nanni’s dataset of 201 examples.
overly-frequent: (429160 EAL instances / 1000 target entities) Entity aspect links associated with the 1000 most frequent entities, where frequency is measures as the number of EAL instances for this target entity.
test: (4967 EAL instances / 1000 target entities) Entity aspect links for 1000 random target entities. (Not including target entities in Nanni-Test or Overly-frequent.)
validation: (4313 EAL instances / 1000 target entities) Entity aspect links for additional 1000 random target entities.
train-small: (5498 EAL instances / 1000 target entities) Entity aspect links for additional 1000 random target entities
train-remaining: (544892 EAL instances / 106392 target entities) All remaining entity aspect links.

We only provide EAL instances from section hyperlinks that meet our quality criteria.

We provide the EAL collection as gzipped JSONL files. Each line will contain one EAL instance as JSON. The json format is documented here.

Instead of unzipping the files, we recommend to open jsonl.gz with a GZipped stream.

Baselines

Features

We provide each feature in form of a trec run, where the EAL-instance ID is the query and the aspect ID is the document. In this work we generate one feature from the score field of the run file.

These feature run files are located in baselines/features-paragraph and baselines/features-sentence.

Resulting Run-files (with learning to rank)

Features are combined with learning to rank, training on the train-small subset.

Trained models are located in baselines/experiment-*/trained-models/. The file format is $train--$featureset--$model.model.

The resulting run files are located in baselines/experiment-*/runs-paragraph and baselines/runs-sentence. The file format is $train--$test--$model.run

We include results of two list-wise learning-to rank toolkits:

ranklib: List-wise learning-to-rank toolkit RankLib, using coordinate ascent to optimize for mean-average precision. Z-score normalization is enabled. We use 20 restarts per fold with 20 iterations each.
rank-lips: List-wise learning-to-rank toolkit rank-lips v1.2 with mini-batched training, using coordinate ascent to optimize for mean-average precision. Mini-batches of 1000 instances are kept for five iterations. To avoid local optima, five restarts are used per fold. Training is iterated until the optimization criterion change is less than 10%, disregarding the first five iterations. Unless otherwise noted, Z-score normalization is deactivated.
rank-lips (Z-score): Like rank-lips but with Z-score normalization.

Evaluation with Trec_eval

The quality of the resulting run files are evaluated with trec_eval -c -q -m all_trec, including query-by-query results to compute standard error (and other analyses)

The evaluation files are located in baselines/experiment-*/eval-paragraph and baselines/experiment-*/eval-sentence. The file format is $train--$test--$model.eval

Corpus Statistics

We provide the fielded corpus statistics used for our BM25 and TF-IDF models.

These were created out of random 200k Wiki-2020 pages, tokenized and lematized with Stanford’s CoreNLP version 3.9.2. (Same tokenizer used by Nanni et al.)

contextText: text anywhere on page (includes section headings and administratiive sections)
contextEntity: titles of entity links anywhere on page
sectionText: text in all sections (excluding the header)
sectionHeader: text of sections (excludes lead text, no section filter was applied)
sectionEntities: titles of entity links in sections (no lead paragraphs, no section filter was applied.)

The first two lines of the corpus statistics contain the following meta information:

First line:
- number of paragraphs analyzed
- number of sections analyzed
Second line: Average “document” length of each of the following fields:
- contextText
- contextEntity
- sectionText
- sectionEntity
- sectionHeading

The corpus statistics are located in baselines/corpus_stats.csv

Nanni’s 201

To facilitate comparison, we offer a re-released version of Nanni’s 201 EAL benchmark using our jsonl.gz format. Information not available in the original nanni-201 datasets is left empty (e.g., entity offsets).

Corpus statistics for Nanni’s 201 test set, are created from all EAL instances.

The dataset can be used in one of the following experimental setups.

Nanni-201-CV: 5-fold crossvalidation on nanni-201
Small/Nanni-201: trained on train-small and/or train-remaining; then tested on nanni-201
Large/Nanni-201: trained on train-small and train-remaining; then tested on nanni-201
Remaining/Nanni-201: trained on train-remaining; then tested on nanni-201

License

Entity-aspect-linking-2020 by Jordan Ramsdell, Laura Dietz is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Based on a work at http://trec-car.cs.unh.edu/datareleases/v2.4-release.html, work at www.wikipedia.org, and on a work at https://federiconanni.com/entity-aspect-linking/.