NSF CAREER: Utilizing Fine-grained Knowledge Annotations in Text Understanding and Retrieval



All members of our information society are constantly in need for quick access to knowledge for work, education, and personal interests. Often the necessary knowledge cannot be found on a single web page or article but deserves a longer, query-specific complex answer. Examples are questions about causes of political events, about nutritional benefits of chocolate, or about the economic viability of wind energy. However, today’s search engines that merely provide a list of multiple sources, but leave users on their own to synthesize sources into knowledge. This project develops novel algorithms that distill relevant key concepts, text passages, and relational information to provide users with a single summary of comprehensive information. The summary is structured into different sections, each covering a different facet of a complex topic. The focus of this project is to identify the relevant facts and connections that will enable users to form their own opinions and make strategic decisions. Embedded in a self-directed-learning environment, it allows users to learn about new topics at their own pace. Integrated educational activities will make use of the software to inspire STEM-interest among middle school students and undergrads of other disciplines.

Today’s algorithms fall short to identify relevant resources for complex topics, as these require knowledge about the world. Utilizing knowledge graphs for information retrieval has led to several important advances, such as the Entity Query Feature Expansion model (EQFE). This project takes a novel holistic approach towards developing representations of knowledge in a knowledge graph, corresponding text annotation algorithms, and retrieval algorithms that work hand-in-hand. The project focuses on three thrusts: (1) Entity aspect linking, which determines the topical context of entity mentions to facilitate a high-precision paragraph ranking. (2) Utilizing relation extractions for open domain information retrieval, where the presence of many non-relevant relations is explicitly addressed. (3) Selecting the query-specific subgraph of the knowledge graph that is suitable to identify relevant entities through long-range dependencies while avoiding concept drift. All three thrusts lead to fine-grained machine-understanding of relevant connections and aspects of entities, aligned through supporting passages that provide provenance for their relevance. The impact of all three thrusts extends beyond information retrieval: many applications that build on entity links will gain topical precision from entity aspect links. Most technology that utilizes relations as-is, will be improved when relational information is considered in context. Any method that extracts information from the knowledge graph structure will perform better when spurious edges are eliminated. Overall, this research effort will lay the foundation for identifying query-relevant complex information in natural large collections.

Project Goals

Thrust 1: Entity aspect linking

Fine-grained annotations for determining the topical context of entity mentions to facilitate a high-precision paragraph ranking.

Thrust 2: Utilizing relation extractions for open domain information retrieval

Framework for utilizing open/schema-based relation extraction for ranking, where the presence of many non-relevant relations is explicitly addressed

Thrust 3: Query-specific subgraph of the knowledge graph

Methods for selecting a query-relevant knowledge subgraph (pre-existing KG or extracted KG).

Current Results

Year 3

  1. Data release of a large-scale aspect-link-annotations for all Wikipedia content Data Set – Following on the release of the aspect catalog from 2020

  2. Achievement on Entity Aspect linking: Utilizing entity aspects in retrieval models, yielding 56% (originally promised: 10%) improvement over our ENT-Rank approach (cf. year 1 report). Earlier, we demonstrated that ENT-Rank is stronger than the reference baselines listed in the original proposal.

Thrust 1, Improving Entity Ranking with Entity Aspects

For the task of entity ranking we develop an approach that uses entity aspect links to indicate which aspect of an entity is being referred to in the context of the query, we obtain more specific relevance indicators for entities. On the TREC CAR and DBpedia V2 datasets we demonstrate that aspect-based features (MAP 0.42) are more informative than traditional entity ranking features MAP 0.37), and outperform several strong baselines, including BERT-based re-ranking method. We obtain this performance boost because previously missing relevant entities are identified by aspect retrieval and aspect link PRF features when combined with L2R. Hence, relevant entities are promoted to the top of the ranking. We further find positive effects of carefully choosing the candidate set of passages for a query: significant performance improvements are obtained when replacing a BM25 candidate set (MAP 0.41) with a candidate set derived from entity-support passages (MAP 0.48). When integrating our approach in ENT-Rank (cf year 1 report, Thrust 3), we not only outperform it (MAP 0.53 vs 0.32) but also improve our own method slightly (MAP 0.50).

The results are published at SIGIR 2021.

Thrust 1, Neural Entity Representations via Entity Aspects

Entity ranking systems often use the Wikipedia page of the entity as the entity’s description. However, majority of the information about the entity present on its Wikipedia page is non-relevant in the context of the query. Hence, in this work, we present BERT Entity Representations (BERT-ER), query-specific entity embeddings obtained from query-specific entity descriptions using BERT. We find that Combining the BERT embeddings of entity aspects yields a 37% improvement over BERT embeddings of the lead paragraph (the most commonly used method). When combined with entity support passages we further obtain a 100% improvement. In an overall system we obtain a 35% improvement over ENT-Rank (cf. Year 1, Thrust 3). A manuscript about this work is under submission.

Thrust 2: Relevant Relations for Relevant Entities

  1. Publication of the study on predicting relevant relations, evaluated via the surrogate task of entity ranking, (mentioned in the year 2 report) at the Text2Story workshop at ECIR 2021 (an A-ranked conference).

  2. Given the lack of applicability of pre-trained OpenIE extractors with Coref, we conducted a preliminary study on NYT dataset on improving relation extraction with a BERT-based neural relation extractor (F1 0.48) versus of a pre-trained OpenIE extractor (F1 0.367). Based on this result we revisit Thrust 2 with a neural network architecture, training relation extraction end-to-end.

  3. Study on the TextGraphs 2020 dataset, on ranking explanations for question answering, using end-to-end relation extraction and explanation ranking. The study revealed that the data set is too small to train the neural network end-to-end, further error analysis suggests that the dataset is focused on the word overlap of individual entities (not their relations).

Thrust 3, Relevant Information Ordering with Neural Graphs

  1. Results demonstrating performance improvements when incorporating entity relevance information and entity-based relations between text passages leads to significant improvements in ordering information to be relevant and coherent for a given search query. The approach extends neural Pointer-Generator-Networks and achieves an 11% error reduction over the Sentence-State LSTM Pointer Network (Zhang, 2018). The manuscript is under submission.

Modern search interfaces require to sequence information into a meaningful order. Hence, merely ranking information by relevance is not enough. We study the task of ordering pieces of information into a relevant and coherent response to the search query. Pointer networks are a natural machine learning approach for ordering tasks. However, current approaches focus on text embeddings, without placing much emphasis on the relevance of information. Results demonstrating performance improvements when incorporating entity relevance information and entity-based relations between text passages leads to significant improvements in ordering information to be relevant and coherent for a given search query. The approach extends neural Pointer-Generator-Networks and achieves an 11% error reduction over the Sentence-State LSTM Pointer Network (Zhang, 2018). The manuscript about this work is under submission.

Year 2

Thrust 1, Entity Aspect Linking with Entity Salience

We find that entity salience when used in a supervised setting with the features from Nanni et al’s entity aspect linking reference method can improve performance by 11% in terms of P@1. We find that salience helps to boost performance for queries that are difficult for the reference methods by leaning on information that is complementary to the “bag-of-features” approach used in the reference method.

Specifically, we improve the reference method (0.65 P@1), by including entity salience indicators (0.72 P@1) - 12% improvement.

Thrust 1, Query-specific Entity Relatedness for Entity Aspect Linking

The reference method depends on the exact matching of entities in the aspect content and entity context. The issue here is that many entities are closely related but not identical. Hence, we develop a soft matching approach that matches entities in context and content using entity relatedness. We study both existing state-of-the-art entity relatedness methods (like Milne-Witten), and develop a new query-context-specific entity relatedness measure based on pseudo-relevance feedback.

We find that our context-specific entity relatedness leads to performance improvements in terms of P@1 (0.74 vs 0.64 P@1) - 16% improvement

Thrust 1, Fine-tuning BERT’s ``Next Sentence’’ Method for Entity Aspect Linking

We explored the potential of Bidirectional Encoder Representations from Transformers (BERT) model for entity aspect linking. Given the constraints of input to flat text of up to 512 sentence-piece tokens, different combinations of context, mention, and aspect were evaluated including concatenation of context mention and aspect. All aspects were ranked based on a score learned through BERT with the topped rank selected as the candidate aspect. The best performing transformer model was one that used concatenations of context, mention and aspect for all 512 tokens available.

This BERT-based implementation improved the reference method by 20% (0.77 vs 0.64 P@1). Even in incorrect cases, the correct candidate is nearly missed, reflected in 0.86 MAP and 0.8963 NDCG@20. - 20% improvement

Thrust 2, Finding relevant relations through corpus vs knowledge graphs.

Most literature about exploiting relations focus on knowledge graphs, where relational information is disembodied from the text that mentions these relations. Our hypothesis is that the context of relations contains strong relevance indicators, which are essential for modelling relevance beyond the mere expression of the relation. For the task of entity ranking compared:

While using simple approaches, this 50% improvement demonstrates the untapped potential of using relevant text for determining which relations are relevant for an information need.


Progress and Negative Results (not listed under Significant Results below):

4d) Joint aspect linking In Section 3.2.1 we suggested a joint linking method that uses contextual entity-aspect links to refine the prediction accuracy with an aspect-to-aspect similarity measure. Initial research using the reference method for producing initial aspect links did not yield any improvements. The error analysis suggests that on the one hand, aspect-to-aspect similarity metrics were too naive, while on the other hand errors in the initial (individual) linking negatively impact the results. We paused the activity to obtain a stronger individual linker, and will resume the activity shortly.

4e) Aspect types Section 3.2.1 suggested to explore the benefit of aspect types (these define classes of aspects shared by multiple entities, e.g. the section names independently of the Wikipedia page). We analyze the difficulty of predicting relevant aspect types for a paragraph. Using bag-of-word features, we compared both an unsupervised retrieval-based methods BM25 versus a supervised Gradient-Boosted Decision tree classifier. Using paragraphs in four medium-sized Wikipedia categories (e.g. “Diseases and Disorders”). We confirmed that supervised classifiers obtain much better performance MRR 0.22 vs 0.05. However having to train individual classifiers for thousands of aspect types is not scalable to all aspects across Wikipedia. The next steps are to explore a training of an aspect embedding and a supervised metric learning approach.

Thrust 2:

Study on predicting relevant relations, evaluated via the surrogate task of entity ranking. 1. The first part of the study compared: a) entity ranking through wikipedia page retrieval (baseline) b) bibliographic coupling and co-coupling on the knowledge graph with and without relevance weighting (baseline) d) paragraph retrieval to weight entities by relevance-weighted co-occurrences - a poor-man’s relation extractor (50% improvement; suggesting the potential merit of this thrust)

  1. The second part uses our state-of-the-art method ENT Rank (published at SIGIR 19, developed as part of year 1 of this grant). The focus is to yield significant improvements over ENT Rank, by incorporating additional indicators of relevant relations (evaluation through ablation study). This activity is ongoing, but the following sub-activities have been completed:
    1. processing of passages with OpenIE relation extractor (Stanford CoreNLP), entity linker (WAT/TagMe, Spotlight, in addition to entity links provided in the dataset), entity coreference resolution (Stanford CoreNLP). – An initial analysis demonstrated that this additional processing is necessary to obtain sufficient coverage of relations with target entities.
    2. Using naive relation indicators did not provide significant improvements
    3. An error analysis on overlap of candidate sets after entity linking and relation extraction (suggesting potential)
    4. Moreover, we are producing additional relation extraction annotations from ReVerb (Fader et al 2011) and MinIE (Gashteovski et al, 2017).

Progress and Negative Results:

2c) Relevant Relations for Entity Ranking

Our goal here is to improve on our ealier results from ENT Rank (reported in year 1) for entity ranking. ENT Rank uses entity, neighbor, and text indicators, our goal is to improve it by incorporating information from relation extraction. However, extracted relations can only be used for the entity ranking task, if their arguments are annotated with entity links (or are coreferent to entity mentions with annotated entity links).

This analysis shows that with ideal relation extraction features, the rank position of 66% of relevant entities can be improved, hence a MAP of 0.6 is potentially possible (compared to ENT Rank: 0.28).

Thrust 3:

Release of software “ENT-Rank-Lips”, a highly efficient re-implementation of the ENT Rank prototype (reported in year 1). The reimplementation allows the use of the machine learning component of ENT Ranks. Where ENT Rank was focused on learning-to-rank entities using entity, neighbor and text features, this implementation allows to produce rankings over any data type (e.g. passages, items, users, relations, etc) and use any type of association data (not just entity mentions in passages, but also extracted relations, etc). This software is used in our experiments for Thrust 1 and Thrust 2.

Year 1

We demonstrated that the best approach for fine-grained entity annotations is ELMo-based Bi-LSTM models with CRF (Magnusson & Dietz, 2019a & b ). This result was achieved on a difficult toponym linking in biomedical publications dataset. Improvements were demonstrated with

  1. ELMo-based Bi-LSTM model with CRF (F1 0.910)
  2. over traditional state-of-the-art entity-linking method TagMe (F1 0.544)
  3. over convolutional neural networks as well as multi-layer perceptrons (F1 0.861)
  4. over an approach that additionally uses domain-specific word embeddings and dictionaries such as Gazeteers (F1 0.909)

We provide a large scale benchmark for one million target entities, whose entity links are to be refined according to their catalog of aspects (Released under CC-SA). We re-implement a strong baseline, offer professionally released code (open source), and feature sets for facilitate rapid research on how to extend this baseline. We demonstrate a reproduction of the prediction quality on the original (small) dataset, and offer baseline evaluation numbers for on the large datasets as well.

Original Dataset (200 instances, 5-fold CV):

  1. As Reported: P@1 0.667±0.034 (Sentence context)
  2. Reproduction with RankLib: 0.632±0.034
  3. Reproduction with rank-lips: 0.622±0.034

Our Large-scale dataset (5498 training instances / 4967 test instances):

  1. Reproduction with RankLib: P@1 0.623±0.015 (Sentence context)
  2. Reproduction with rank-lips: 0.627±0.015

We improved the state-of-the-art on entity ranking with our novel ENT-rank framework for integrating entity, neighbor, and text relation features with learning to rank. Best performance was demonstrated on two benchmarks: TREC CAR entity (Dietz et al, 2018) with in comparison to eleven baseline systems; and DBpedia-entity-v2 (Hasibi et al, 2017) in comparison to twelve baseline systems. This positive result is crucial for the feasibility of project, since the proposed research activities were planned to be implemented within the ENT-rank framework. Furthermore the entity ranking task is identical to the “node retrieval” task, as laid out in the proposal Section 3.3.3.

TREC CAR entity (section-level):

  1. ENT-Rank obtains NDCG@100 manual 0.592 / automatic 0.428
  2. Entity features with learning to rank (best TREC CAR submission) NDCG@100 manual 0.514 / automatic 0.316

…omitted 10 weaker baselines…

DBpedia entity v2:

  1. ENT-Rank obtains NDCG@100 0.711
  2. coordinate ascent trained BM25F obtains NDCG@100 0.680
  3. Fielded Sequential Dependence model (Zhilsov et al, 2015) obtains NDCG@100 0.663

…omitted 10 weaker baselines…

We demonstrated that humans agree with our fully-automatic evaluation approach for passage ranking and entity ranking for producing comprehensive answers. This result is important as it allows us to efficiently evaluate our methods on large-scale benchmarks.

Our study demonstrates that agreement of human assessors versus fully automated approach is

  1. on passage ranking: Kendall’s tau 0.93 / Spearman’s Rank coefficient 0.98
  2. entity ranking on Wikipedia topics: 0.74 / 0.90
  3. entity ranking on text book chapters: 0.74 / 0.89



All publications are accompanied by poster presentations or oral talks.


We provide a large-scale dataset for training and evaluating as well as baseline methods for entity aspect linking (EAL), a fine-grained variation on entity linking that discerns which particular aspect of an entity is mentioned in the context.

Official Resource Page

The release of the datasets used in the ENT Rank paper, together with derived features files and ranking predictions allows to reproduce the research findings. We use the ‘Nix’ build system to guarantee that code compilation can be reproduced under any Linux operating system for many years into the future.

Release of entity aspect linking dataset with over 1 mio examples, split into train/val/test, with reference methods and feature sets for reproducibility. Data released under Creative-Commons Attribution Share-alike license.

Official Resource PageConference Talk

Full Conference Paper on the entity aspect linking dataset and the reproduction of Nanni et al’s reference methods. Ramsdell, Jordan, and Laura Dietz. “A Large Test Collection for Entity Aspect Linking.” Proceedings of the 29th ACM International Conference on Information & Knowledge Management. 2020.

Release of entity aspect annotations for all of Wikipedia, as well as reference implementations and features sets is enabling other researchers to develop new models using these features in novel ways and adding new features with an ablation study. The data relates to the aspect catalog from 2020. The dataset is released under open data and open source licenses.

Data release of a large-scale aspect-link-annotations for all Wikipedia content Data Set.


Release of the ENT Rank framework code is enabling other researchers to develop new methods using the entity-neighbor-text features, and devising models that use these features on novel ways, and provide new relevance-indicators as features to the model.

Release of software ENT-Rank-Lips, a high-performance implementation of ENT-Rank under BSD license. While originally developed for entity ranking, ENT-Rank-Lips is easily customizable to apply the same algorithm to rank any modality (e.g. entity or passages) with any other external data (e.g. passages and knowledge graph relations) through associations in the form of (Data1, association, Data2), e.g. (Passage, Entity Link, Entity) or (Entity1, Relation, Entity2).

Broader Impacts

The educational activities are designed to broaden participation in STEM with an inter-disciplinary un- dergraduate course on Knowledge Search Engines. Teaching middle school students how search engines use external knowledge (including private data) will lead to better digital citizenship and will spike STEM interests.

By creating key technology for a deep understanding of text and knowledge, this work leads to search engines that can give a comprehensive response. Such applications are valuable for a modern society that needs to learn quickly and also support self-directed learning in the classroom. Furthermore, this work lays the foundations for conversational agents that are capable to discuss topics on science and society with their users.

Because of the COVID 19 pandemic all educational and outreach activities are delayed.


This material is based upon work supported by the National Science Foundation under Grant No. 1846017.


Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

Point of contact: Laura Dietz

This page was last updated December 2019.