This course counts towards your implementation-intensive requirements.
This means that your grade will be split as follows:
As a running example of a data science application, we will focus on a shared-task called Complex Answer Retrieval  hosted by the Text Retrieval Conference (TREC) .
The purpose of complex answer retrieval is to automatically compose Wikipedia articles by understanding the meaning of text and drawing connections between meanings through the use of machine learning, graph analysis and natural language processing tools.
Heads up: there will be programming homework in the first week.
You need to be comfortable writing large-scale programs on your own in order to take this course. If you can’t program, you need to learn programming before taking this course.
This course covers basic and advanced algorithms and techniques for data science with knowledge graph and text data.
During this course you will learn about a wide range of algorithms for graph processing, text processing, and information retrieval with a focus of knowledge graphs such as Wikipedia, DBpedia, Freebase, and Yago and text from knowledge articles such as Wikipedia and the world-wide Web.
You will be selecting some of these methods to solve the task given by TREC CAR. You will be implementing some of these algorithms yourself or you will be using implementations of those algorithms in your own code to produce a fully-automatic prototype for complex answer retrieval. You will use your prototype to make a submission to the shared task (competing with researchers world-wide). Forming teams of up to three people is highly encouraged.
Before the submission, you will be implementing an evaluation framework for assessing which of these approaches work best. Evaluation data will be provided by the TREC CAR organizers, but you will need to develop a test framework which can evaluate not just your methods, but also methods of your competing teams. This further includes statistical analysis of experimental evaluation, which is the bread & butter of all data-centric research and a highly demanded skill by industry.
We will be using tools for software development in a team, as well as publication and distribution of software artifacts in a research setting.
During the course we will be discussing introductory and advanced research papers on various topics of natural language processing, knowledge graph inference, semantic web, and information retrieval. These include entity linking and relation extraction, graph walk algorithms, graph clustering, text-based similarity measures, information retrieval models, text clustering methods and topic models as well as other machine learning methods.
We discuss different methods and how they make use of data and training signals, how they integrate with each other and how they contribute to an approach for the example application of TREC CAR. We discuss how to obtain required training signals automatically from data or through manual annotations by human judges.
Prerequisites: CS 853 Topics/Information Retrieval or permission of instructor. Knowledge of data structures and basic algorithms (such as CS 515). Ability to independently write programs in a language of your choice.
Class and submission schedule - subject to change.
Topics of choice are proposed by students.
Implementation-level issues are discussed during prototype clinic sessions. All students are expected to be present during these sessions and make fruitful contributions.
First class: Jan 23
No class on - Feb 13, Feb 15 - March 13, March 15 (spring break)
Hackathon classes (5:10 - 8:00 pm) - bring computers and food! - Feb 1 and Feb 5 - March 20 and March 22
Final project presentations on - May 1, May 3
Your grade will be based the quality of the implemented prototype (70%) and class participation (30%). You need to obtain a passing grade in both to pass this course.
The prototype will be implemented in teams of up to three people. The project will be implemented in a programming language of your choice. The projects need to be presented in class and will be graded based on: - Performance on the given task - Correctness of the implemented methods - Code quality, legibility, documentation, and use of software-development tools (version control, dependency-management, documentation) - Organization of the team and team spirit - Understandability of the final report
Research methods will be studied as a Journal club. Every week all students read three assigned papers. Reading notes are submitted before 8am on the day of the class. Each paper will be presented by one student. The participation grade will be based on - Quality of reading notes - Quality of presentation - Activity in the discussion (in class as well as on Piazza)
Excellent contributions will be rewarded with an upgrade of the final grade.
Late homework and project report submissions will generally be excluded. Any missed activity due to medical or families emergencies requires supporting documentation.
Students are expected to:
The instructor is expected to:
Note that is not sufficient to just be present in class and submit reading notes. If stuck, please see the instructor.
The instructor is strongly committed to upholding the standards of academic integrity. These standards, at the minimum, require that students never present the work of others as their own. Any dishonest behavior, once discovered, will be penalized according to the University’s Student Code of Conduct.
The lecture is not based on a book. The following books are recommended for further study.
C. D. Manning, P. Raghavan and H. Schütze, Introduction to Information Retrieval, Cambridge University Press, 2008 (available at http://nlp.stanford.edu/IR-book).
C. Zhai and S. Massung, Text Data management and Analysis: A Practical Introduction to Information Retrieval and Text Mining“, ACM and Morgan & Claypool Publishers, 2016. (obtain through http://www.morganclaypoolpublishers.com/catalog_Orig/product_info.php?products_id=944 )