CS 753/853 Information Retrieval and Web Search

(Coordinator: Laura Dietz)

Catalog description

This course covers basic and advanced algorithms and techniques for text-based information retrieval and Web search. The course focuses on index building, query processing, and document ranking. We will further touch on text classification and clustering as well as crawling and link-based algorithms. Prereq: Data structures (CS515).

Overview

This course covers basic and advanced algorithms and techniques for Web search engines as well as text-based information retrieval in general.

After this course you will be able to develop your own web search engine or customize existing retrieval frameworks such as Apache Lucene. Every week we will carefully examine a different component of a web search engine system.

The course focuses on index building, query processing, and document ranking. We will further touch on text-based machine learning methods, such as classification and clustering, as well as crawling and link-based algorithms such as Google’s PageRank.

The course will cover several algorithms and data structures with application to web search, thereby building on CS 515 “Data Structures”. Both theoretical analyses of run-time performance as well as hands-on programming assignments and a class project are part of the course.

Information retrieval methods are an essential component in any text-based data analytics system, ranging from text mining and machine learning, to natural language processing and knowledge management applications.

Attributes

  • This course is one of the CS electives.

Requirements

Data Structures (CS 515) or permission of instructor.

Ability to independently write programs in either Java, Python, or other language of your choice.

Outcomes

Models, Abstractions, and Algorithms

  • Understanding vector space models and probabilistic models for text.

  • Being comfortable with the terminology of data science, i.e., defining a task, a method, and an evaluation framework.

  • Understanding, applying, and extending algorithms for:

  • full-text indexes, posting-lists, binary trees
  • random walks on graphs (such as PageRank)
  • processing of queries in natural language
  • machine learning algorithms for classification and ranking

  • Determining runtime and space complexity of algorithms in "big-O" notation and giving arguments for efficiency.

Programming and Software Engineering

  • Implementing algorithms in a programming language of their choice (independently).

  • Implementation of a multi-component system that uses third-party libraries, including data import and export.

  • Implementing an evaluation framework, including evaluation measures such as Precision, Recall, and NDCG.

  • Distributing code, including instructions for compilation and usage of the program.

Applications

  • Being able to build a complete Web search system, including Web crawling.

  • Being able to customize algorithm for a domain-specific application, such as cross-language retrieval, search in social media, and correlating text and numerical/categorical data.

Self-learning, Community, and Teamwork

  • Being able to teach oneself algorithms, metrics, and concepts from books and research publications.

  • Being able to find mistakes in one's homework submissions.

  • Learning in partner and group discussions.

  • Implement a class project in small teams.

  • Peer-reviewing class project proposals and discussions across teams.

Evaluation

Five homework assignments (graded pass/fail/bonus), one final class project (50%), two exams (50%).

Topics

This course covers everything one needs to know to build an end-to-end information retrieval, such as a Web search engine. This includes:

  • Design of algorithms and data structures, including theoretical analysis of runtime complexity, with a particular focus on look-up data structures such as inverted indexes, postings lists, and trees.

  • Introductory machine learning algorithms and probabilistic models, with a focus on vector space models, multinomial distributions and Bayes rule, classification, clustering, and discriminative learning-to-rank algorithms.

  • Natural language processing of input documents and user queries, including spelling correction.

  • Random walk algorithms on graphs, such as Google's PageRank.

  • Integration of abstract models into one end-to-end retrieval system that is capable of crawling Web pages for indexing for efficient document retrieval with a user-provided keyword query.

Textbook

Required:

C. D. Manning, P. Raghavan and H. Schütze, Introduction to Information Retrieval, Cambridge University Press, 2008 (available at http://nlp.stanford.edu/IR-book).

B. Croft, D. Metzler, T. Strohman, Search Engines: Information Retrieval in Practice, Addison-Wesley, 2009 (available at http://ciir.cs.umass.edu/irbook/.

C. Zhai and S. Massung, Text Data Management and Analysis: A Practical Introduction to Information Retrieval and Text Mining, ACM and Morgan & Claypool Publishers, 2016. (obtain through http://www.morganclaypoolpublishers.com/catalog_Orig/product_info.php?products_id=944).