Agenda: Introduction to Data Science

1. What is Data Science?

2. What do I need to know to be a Data Scientist?

3. Task, Evaluation, System, Methods and how to read papers

4. Task for prototype: TREC CAR (Complex Answer Retrieval track at the Text Retrieval Conference)

 

What is Data Science?

Definition from:

- Wikipedia: wiki-data-science.pdf Preview the document View in a new window

- Berkeley: berkeley-what-is-datascience.pdf Preview the document View in a new window

- NYU: nyu-what-is-datascience.pdf Preview the document View in a new window

- Several people on Quora: quora-what-is-data-science.pdf Preview the document View in a new window

 

Related terms (from Wikipedia)

- Data Mining: wiki-data-mining.pdf Preview the document View in a new window

- Data Journalism: wiki-data-journalism.pdf Preview the document View in a new window

 

What do I need to know to be a Data Scientist?

Many online courses focus on programming python and R, on particular machine learning toolkits, statistical methods, and visualization.

Someone came up with a roadmap on different topics associated with Data Science:

RoadToDataScientist1.png View in a new window

RoadToDataScientist1.png

 

It is impossible to discuss all these topics within a single course. In this course, an emphasis is placed on methods for a science on textual data and knowledge graph data - the orange branch in the map, and beyond. Our journey through this road map will also include fundamentals (blue), machine learning (yellow), and toolboxes (brown). By implementing your prototype you will automatically learn about topics in data munging (pink), data ingestions (green), and programming (yellowish green). Topics of quantitative evaluation (statistics, light blue) and presentation (visualization, red) will be used to assess the performance of your prototype.

 

Task, Evaluation, System, Methods and how to read papers

slides-week1-task-evaluation-system-methods-papers.pdf Preview the document View in a new window

 

Task for Prototype: TREC CAR

See the website of the Complex Answer Retrieval track that is hosted at TREC this year for a detailed task and data description.

 









Everyone must read both mandatory papers and a third one from the list below.

 

TREC CAR Task (20 minutes)

We go through the first half of the TREC CAR task presentation given at the planning session at TREC in November 2016.  trec-car-planning.svg View in a new window

 

Agenda  (60 minutes)

 

1. 10 minute introduction to the topic

2. Discussion of reading notes

3. Questions and "not understood" parts

4. Paper discussion (Section-by-section)

5. Final research paper deconstruct

 

Introduction: The presenter should give a 10 minute introduction to the topic. Roughly: what is it about? What are critical definitions? How is this area roughly evaluated?

 

Reading notes: The presenter will talk about her/his submitted reading notes, and other members of the audience are asked to talk about their reading notes as well.

 

Questions: At this point any question or parts that are not understood need to be listed by the presenter and the audience. (You better ask the question before I ask you.)

 

Paper discussion: This is to be followed by section-by-section paper discussion. This discussion is facilitated by the presenter but everyone is expected to contribute. In this discussion, we walk through some of the papers - section by section - and recap the most important points. This is another opportunity of the presenter and the audience to ask questions and point out connections to other papers.

 

Research paper deconstruct: One outcome of this discussion is a better "research paper deconstruct" (cf. my last lecture). The reading notes which are due before class are already one attempt at a paper deconstruct. But often a second attempt is better than the first.

 

 

Mandatory Reading Assignments

Everyone must read (and summarize) these:

Haveliwala, Taher H. "Topic-sensitive pagerank." In Proceedings of the 11th international conference on World Wide Web, pp. 517-526. ACM, 2002.

http://ilpubs.stanford.edu:8090/573/1/2002-6.pdf (Links to an external site.)

Navigli, Roberto, and Mirella Lapata. "Graph Connectivity Measures for Unsupervised Word Sense Disambiguation." In IJCAI, pp. 1683-1688. 2007.

http://www.aaai.org/Papers/IJCAI/2007/IJCAI07-272.pdf (Links to an external site.)

 

Farahat, Ayman, et al. "Authority rankings from HITS, PageRank, and SALSA: Existence, uniqueness, and effect of initialization." SIAM Journal on Scientific Computing 27.4 (2006): 1181-1201.

https://s3.amazonaws.com/academia.edu.documents/44429954/Authority_Rankings_from_HITS_PageRank_an20160405-30697-jbtv0b.pdf?AWSAccessKeyId=AKIAIWOWYYGZ2Y53UL3A&Expires=1516668455&Signature=%2BvGT2mn0qR%2FvGj3kG5OnP6cgYOM%3D&response-content-disposition=inline%3B%20filename%3DAuthority_Rankings_from_HITS_PageRank_an.pdf

Further reading, everyone must read (and summarize) one from the following list:

 

Hulpus, Ioana, Conor Hayes, Marcel Karnstedt, and Derek Greene. "Unsupervised graph-based topic labelling using dbpediaEveryone must read both mandatory papers and a third one from the list below.

 

TREC CAR Task (20 minutes)

We go through the first half of the TREC CAR task presentation given at the planning session at TREC in November 2016.  trec-car-planning.svg View in a new window

 






Agenda  (60 minutes): Graph Walks

 

1. 10 minute introduction to the topic

2. Discussion of reading notes

3. Questions and "not understood" parts

4. Paper discussion (Section-by-section)

5. Final research paper deconstruct

 

Introduction: The presenter should give a 10 minute introduction to the topic. Roughly: what is it about? What are critical definitions? How is this area roughly evaluated?

 

Reading notes: The presenter will talk about her/his submitted reading notes, and other members of the audience are asked to talk about their reading notes as well.

 

Questions: At this point any question or parts that are not understood need to be listed by the presenter and the audience. (You better ask the question before I ask you.)

 

Paper discussion: This is to be followed by section-by-section paper discussion. This discussion is facilitated by the presenter but everyone is expected to contribute. In this discussion, we walk through some of the papers - section by section - and recap the most important points. This is another opportunity of the presenter and the audience to ask questions and point out connections to other papers.

 

Research paper deconstruct: One outcome of this discussion is a better "research paper deconstruct" (cf. my last lecture). The reading notes which are due before class are already one attempt at a paper deconstruct. But often a second attempt is better than the first.

 

 

Mandatory Reading Assignments

Everyone must read (and summarize) these:

Haveliwala, Taher H. "Topic-sensitive pagerank." In Proceedings of the 11th international conference on World Wide Web, pp. 517-526. ACM, 2002.

http://ilpubs.stanford.edu:8090/573/1/2002-6.pdf (Links to an external site.)

Navigli, Roberto, and Mirella Lapata. "Graph Connectivity Measures for Unsupervised Word Sense Disambiguation." In IJCAI, pp. 1683-1688. 2007.

http://www.aaai.org/Papers/IJCAI/2007/IJCAI07-272.pdf (Links to an external site.)

 

Further reading, everyone must read (and summarize) one from the following list:

 

Hulpus, Ioana, Conor Hayes, Marcel Karnstedt, and Derek Greene. "Unsupervised graph-based topic labelling using dbpedia." In Proceedings of the sixth ACM international conference on Web search and data mining, pp. 465-474. ACM, 2013.

https://aran.library.nuigalway.ie/xmlui/bitstream/handle/10379/4528/wsdm_noCopyright.pdf?sequence=1 (Links to an external site.)

 

Chakrabarti, Soumen. "Dynamic personalized pagerank in entity-relation graphs." In Proceedings of the 16th international conference on World Wide Web, pp. 571-580. ACM, 2007.

https://www.cse.iitb.ac.in/~soumen/doc/www2007/www324-chakrabarti.pdf (Links to an external site.)

 

Yeh, Eric, Daniel Ramage, Christopher D. Manning, Eneko Agirre, and Aitor Soroa. "WikiWalk: random walks on Wikipedia for semantic relatedness." In Proceedings of the 2009 Workshop on Everyone must read both mandatory papers and a third one from the list below.

 

TREC CAR Task (20 minutes)

We go through the first half of the TREC CAR task presentation given at the planning session at TREC in November 2016.  trec-car-planning.svg View in a new window

 

Agenda  (60 minutes)

 

1. 10 minute introduction to the topic

2. Discussion of reading notes

3. Questions and "not understood" parts

4. Paper discussion (Section-by-section)

5. Final research paper deconstruct

 

Introduction: The presenter should give a 10 minute introduction to the topic. Roughly: what is it about? What are critical definitions? How is this area roughly evaluated?

 

Reading notes: The presenter will talk about her/his submitted reading notes, and other members of the audience are asked to talk about their reading notes as well.

 

Questions: At this point any question or parts that are not understood need to be listed by the presenter and the audience. (You better ask the question before I ask you.)

 

Paper discussion: This is to be followed by section-by-section paper discussion. This discussion is facilitated by the presenter but everyone is expected to contribute. In this discussion, we walk through some of the papers - section by section - and recap the most important points. This is another opportunity of the presenter and the audience to ask questions and point out connections to other papers.

 

Research paper deconstruct: One outcome of this discussion is a better "research paper deconstruct" (cf. my last lecture). The reading notes which are due before class are already one attempt at a paper deconstruct. But often a second attempt is better than the first.

 

 

Mandatory Reading Assignments

Everyone must read (and summarize) these:

Haveliwala, Taher H. "Topic-sensitive pagerank." In Proceedings of the 11th international conference on World Wide Web, pp. 517-526. ACM, 2002.

http://ilpubs.stanford.edu:8090/573/1/2002-6.pdf (Links to an external site.)

Navigli, Roberto, and Mirella Lapata. "Graph Connectivity Measures for Unsupervised Word Sense Disambiguation." In IJCAI, pp. 1683-1688. 2007.

http://www.aaai.org/Papers/IJCAI/2007/IJCAI07-272.pdf (Links to an external site.)

 

Further reading, everyone must read (and summarize) one from the following list:

 

Hulpus, Ioana, Conor Hayes, Marcel Karnstedt, and Derek Greene. "Unsupervised graph-based topic labelling using dbpedia." In Proceedings of the sixth ACM international conference on Web search and data mining, pp. 465-474. ACM, 2013.

https://aran.library.nuigalway.ie/xmlui/bitstream/handle/10379/4528/wsdm_noCopyright.pdf?sequence=1 (Links to an external site.)

 

Chakrabarti, Soumen. "Dynamic personalized pagerank in entity-relation graphs." In Proceedings of the 16th international conference on World Wide Web, pp. 571-580. ACM, 2007.

https://www.cse.iitb.ac.in/~soumen/doc/www2007/www324-chakrabarti.pdf (Links to an external site.)

 

Yeh, Eric, Daniel Ramage, Christopher D. Manning, Eneko Agirre, and Aitor Soroa. "WikiWalk: random walks on Wikipedia for semantic relatedness." In Proceedings of the 2009 Workshop on Graph-based Methods for Natural Language Processing, pp. 41-49. Association for Computational Linguistics, 2009.

http://www.anthology.aclweb.org/W/W09/W09-32.pdf#page=53 (Links to an external site.)

 

Baluja, Shumeet, Rohan Seth, D. Sivakumar, Yushi Jing, Jay Yagnik, Shankar Kumar, Deepak Ravichandran, and Mohamed Aly. "Video suggestion and discovery for youtube: taking random walks through the view graph." In Proceedings of the 17th international conference on World Wide Web, pp. 895-904. ACM, 2008.

http://www.esprockets.com/papers/adsorption-yt.pdf (Links to an external site.)

 

Agirre, Eneko, Oier López de Lacalle, and Aitor Soroa. "Random Walks for Knowledge-Based Word Sense Disambiguation." Computational Linguistics 40, no. 1, 2014, pp 57-84.

http://anthology.aclweb.org/J/J14/J14-1003.pdf (Links to an external site.)

 

 

 

 

Background Reading (optional, introductory)

Book "Text Data Management and Analysis" Chapter 10.3

Page, Lawrence, Sergey Brin, Rajeev Motwani, and Terry Winograd. The PageRank citation ranking: Bringing order to the web. Stanford InfoLab, 1999.

http://ilpubs.stanford.edu:8090/422/1/1999-66.pdf (Links to an external site.)

 

Mihalcea, Rada, and Dragomir Radev. Graph-based natural language processing and information retrieval. Cambridge University Press, 2011. ISBN:0521896134 9780521896139 

https://books.google.com/books?hl=en&lr=lang_en|lang_fr|lang_de&id=kByP4X9c5AQC&oi=fnd&pg=PA1&ots=_CIj4i4l5R&sig=iAhxKREc9v7qixKAr_gbK-SueEo#v=onepage&q&f=false (Links to an external site.)

Eppstein, David. "Finding the k shortest paths." SIAM Journal on computing 28, no. 2 (1998): 652-673.

http://www.ics.uci.edu/~eppstein/pubs/Epp-SJC-98.pdf (Links to an external site.)

 

 

Advanced Reading (Continue here if this was too easy)

Backstrom, Lars, and Jure Leskovec. "Supervised random walks: predicting and recommending links in social networks." In Proceedings of the fourth ACM international conference on Web search and data mining, pp. 635-644. ACM, 2011.

http://www-cs-faculty.stanford.edu/people/jure/pubs/linkpred-wsdm11.pdf (Links to an external site.)

Bahmani, Bahman, Kaushik Chakrabarti, and Dong Xin. "Fast personalized pagerank on mapreduce." In Proceedings of the 2011 ACM SIGMOD International Conference on Management of data, pp. 973-984. ACM, 2011.

https://www.kth.se/social/files/5526e966f276542e80c58ce5/mod113-bahmani.pdf (Links to an external site.)

Talukdar, P. P., Reisinger, J., Paşca, M., Ravichandran, D., Bhagat, R., & Pereira, F. (2008, October). Weakly-supervised acquisition of labeled class instances using graph random walks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (pp. 582-590). Association for Computational Linguistics.

http://www.anthology.aclweb.org/D/D08/D08-1.pdf#page=612

Graph-based Methods for Natural Language Processing, pp. 41-49. Association for Computational Linguistics, 2009.

http://www.anthology.aclweb.org/W/W09/W09-32.pdf#page=53 (Links to an external site.)

 

Baluja, Shumeet, Rohan Seth, D. Sivakumar, Yushi Jing, Jay Yagnik, Shankar Kumar, Deepak Ravichandran, and Mohamed Aly. "Video suggestion and discovery for youtube: taking random walks through the view graph." In Proceedings of the 17th international conference on World Wide Web, pp. 895-904. ACM, 2008.

http://www.esprockets.com/papers/adsorption-yt.pdf (Links to an external site.)

 

Agirre, Eneko, Oier López de Lacalle, and Aitor Soroa. "Random Walks for Knowledge-Based Word Sense Disambiguation." Computational Linguistics 40, no. 1, 2014, pp 57-84.

http://anthology.aclweb.org/J/J14/J14-1003.pdf (Links to an external site.)

 

 

 

 

Background Reading (optional, introductory)

Book "Text Data Management and Analysis" Chapter 10.3

Page, Lawrence, Sergey Brin, Rajeev Motwani, and Terry Winograd. The PageRank citation ranking: Bringing order to the web. Stanford InfoLab, 1999.

http://ilpubs.stanford.edu:8090/422/1/1999-66.pdf (Links to an external site.)

 

Mihalcea, Rada, and Dragomir Radev. Graph-based natural language processing and information retrieval. Cambridge University Press, 2011. ISBN:0521896134 9780521896139 

https://books.google.com/books?hl=en&lr=lang_en|lang_fr|lang_de&id=kByP4X9c5AQC&oi=fnd&pg=PA1&ots=_CIj4i4l5R&sig=iAhxKREc9v7qixKAr_gbK-SueEo#v=onepage&q&f=false (Links to an external site.)

Eppstein, David. "Finding the k shortest paths." SIAM Journal on computing 28, no. 2 (1998): 652-673.

http://www.ics.uci.edu/~eppstein/pubs/Epp-SJC-98.pdf (Links to an external site.)

 

 

Advanced Reading (Continue here if this was too easy)

Backstrom, Lars, and Jure Leskovec. "Supervised random walks: predicting and recommending links in social networks." In Proceedings of the fourth ACM international conference on Web search and data mining, pp. 635-644. ACM, 2011.

http://www-cs-faculty.stanford.edu/people/jure/pubs/linkpred-wsdm11.pdf (Links to an external site.)

Bahmani, Bahman, Kaushik Chakrabarti, and Dong Xin. "Fast personalized pagerank on mapreduce." In Proceedings of the 2011 ACM SIGMOD International Conference on Management of data, pp. 973-984. ACM, 2011.

https://www.kth.se/social/files/5526e966f276542e80c58ce5/mod113-bahmani.pdf (Links to an external site.)

Talukdar, P. P., Reisinger, J., Paşca, M., Ravichandran, D., Bhagat, R., & Pereira, F. (2008, October). Weakly-supervised acquisition of labeled class instances using graph random walks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (pp. 582-590). Association for Computational Linguistics.

http://www.anthology.aclweb.org/D/D08/D08-1.pdf#page=612

." In Proceedings of the sixth ACM international conference on Web search and data mining, pp. 465-474. ACM, 2013.

https://aran.library.nuigalway.ie/xmlui/bitstream/handle/10379/4528/wsdm_noCopyright.pdf?sequence=1 (Links to an external site.)

 

Chakrabarti, Soumen. "Dynamic personalized pagerank in entity-relation graphs." In Proceedings of the 16th international conference on World Wide Web, pp. 571-580. ACM, 2007.

https://www.cse.iitb.ac.in/~soumen/doc/www2007/www324-chakrabarti.pdf (Links to an external site.)

 

Yeh, Eric, Daniel Ramage, Christopher D. Manning, Eneko Agirre, and Aitor Soroa. "WikiWalk: random walks on Wikipedia for semantic relatedness." In Proceedings of the 2009 Workshop on Graph-based Methods for Natural Language Processing, pp. 41-49. Association for Computational Linguistics, 2009.

http://www.anthology.aclweb.org/W/W09/W09-32.pdf#page=53 (Links to an external site.)

 

Baluja, Shumeet, Rohan Seth, D. Sivakumar, Yushi Jing, Jay Yagnik, Shankar Kumar, Deepak Ravichandran, and Mohamed Aly. "Video suggestion and discovery for youtube: taking random walks through the view graph." In Proceedings of the 17th international conference on World Wide Web, pp. 895-904. ACM, 2008.

http://www.esprockets.com/papers/adsorption-yt.pdf (Links to an external site.)

 

Agirre, Eneko, Oier López de Lacalle, and Aitor Soroa. "Random Walks for Knowledge-Based Word Sense Disambiguation." Computational Linguistics 40, no. 1, 2014, pp 57-84.

http://anthology.aclweb.org/J/J14/J14-1003.pdf (Links to an external site.)

 

 

 

 

Background Reading (optional, introductory)

Book "Text Data Management and Analysis" Chapter 10.3

Page, Lawrence, Sergey Brin, Rajeev Motwani, and Terry Winograd. The PageRank citation ranking: Bringing order to the web. Stanford InfoLab, 1999.

http://ilpubs.stanford.edu:8090/422/1/1999-66.pdf (Links to an external site.)

 

Mihalcea, Rada, and Dragomir Radev. Graph-based natural language processing and information retrieval. Cambridge University Press, 2011. ISBN:0521896134 9780521896139 

https://books.google.com/books?hl=en&lr=lang_en|lang_fr|lang_de&id=kByP4X9c5AQC&oi=fnd&pg=PA1&ots=_CIj4i4l5R&sig=iAhxKREc9v7qixKAr_gbK-SueEo#v=onepage&q&f=false (Links to an external site.)

Eppstein, David. "Finding the k shortest paths." SIAM Journal on computing 28, no. 2 (1998): 652-673.

http://www.ics.uci.edu/~eppstein/pubs/Epp-SJC-98.pdf (Links to an external site.)

 

 

Advanced Reading (Continue here if this was too easy)

Backstrom, Lars, and Jure Leskovec. "Supervised random walks: predicting and recommending links in social networks." In Proceedings of the fourth ACM international conference on Web search and data mining, pp. 635-644. ACM, 2011.

http://www-cs-faculty.stanford.edu/people/jure/pubs/linkpred-wsdm11.pdf (Links to an external site.)

Bahmani, Bahman, Kaushik Chakrabarti, and Dong Xin. "Fast personalized pagerank on mapreduce." In Proceedings of the 2011 ACM SIGMOD International Conference on Management of data, pp. 973-984. ACM, 2011.

https://www.kth.se/social/files/5526e966f276542e80c58ce5/mod113-bahmani.pdf (Links to an external site.)

Talukdar, P. P., Reisinger, J., Paşca, M., Ravichandran, D., Bhagat, R., & Pereira, F. (2008, October). Weakly-supervised acquisition of labeled class instances using graph random walks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (pp. 582-590). Association for Computational Linguistics.

http://www.anthology.aclweb.org/D/D08/D08-1.pdf#page=612

Everyone must read both mandatory papers and a third one from the list below.

 

TREC CAR Task (20 minutes)

We go through the first half of the TREC CAR task presentation given at the planning session at TREC in November 2016.  trec-car-planning.svg View in a new window

 

Agenda  (60 minutes)

 

1. 10 minute introduction to the topic

2. Discussion of reading notes

3. Questions and "not understood" parts

4. Paper discussion (Section-by-section)

5. Final research paper deconstruct

 

Introduction: The presenter should give a 10 minute introduction to the topic. Roughly: what is it about? What are critical definitions? How is this area roughly evaluated?

 

Reading notes: The presenter will talk about her/his submitted reading notes, and other members of the audience are asked to talk about their reading notes as well.

 

Questions: At this point any question or parts that are not understood need to be listed by the presenter and the audience. (You better ask the question before I ask you.)

 

Paper discussion: This is to be followed by section-by-section paper discussion. This discussion is facilitated by the presenter but everyone is expected to contribute. In this discussion, we walk through some of the papers - section by section - and recap the most important points. This is another opportunity of the presenter and the audience to ask questions and point out connections to other papers.

 

Research paper deconstruct: One outcome of this discussion is a better "research paper deconstruct" (cf. my last lecture). The reading notes which are due before class are already one attempt at a paper deconstruct. But often a second attempt is better than the first.

 

 

Mandatory Reading Assignments

Everyone must read (and summarize) these:

Haveliwala, Taher H. "Topic-sensitive pagerank." In Proceedings of the 11th international conference on World Wide Web, pp. 517-526. ACM, 2002.

http://ilpubs.stanford.edu:8090/573/1/2002-6.pdf (Links to an external site.)

Navigli, Roberto, and Mirella Lapata. "Graph Connectivity Measures for Unsupervised Word Sense Disambiguation." In IJCAI, pp. 1683-1688. 2007.

http://www.aaai.org/Papers/IJCAI/2007/IJCAI07-272.pdf (Links to an external site.)

 

Further reading, everyone must read (and summarize) one from the following list:

 

Hulpus, Ioana, Conor Hayes, Marcel Karnstedt, and Derek Greene. "Unsupervised graph-based topic labelling using dbpedia." In Proceedings of the sixth ACM international conference on Web search and data mining, pp. 465-474. ACM, 2013.

https://aran.library.nuigalway.ie/xmlui/bitstream/handle/10379/4528/wsdm_noCopyright.pdf?sequence=1 (Links to an external site.)

 

Chakrabarti, Soumen. "Dynamic personalized pagerank in entity-relation graphs." In Proceedings of the 16th international conference on World Wide Web, pp. 571-580. ACM, 2007.

https://www.cse.iitb.ac.in/~soumen/doc/www2007/www324-chakrabarti.pdf (Links to an external site.)

 

Yeh, Eric, Daniel Ramage, Christopher D. Manning, Eneko Agirre, and Aitor Soroa. "WikiWalk: random walks on Wikipedia for semantic relatedness." In Proceedings of the 2009 Workshop on Graph-based Methods for Natural Language Processing, pp. 41-49. Association for Computational Linguistics, 2009.

http://www.anthology.aclweb.org/W/W09/W09-32.pdf#page=53 (Links to an external site.)

 

Baluja, Shumeet, Rohan Seth, D. Sivakumar, Yushi Jing, Jay Yagnik, Shankar Kumar, Deepak Ravichandran, and Mohamed Aly. "Video suggestion and discovery for youtube: taking random walks through the view graph." In Proceedings of the 17th international conference on World Wide Web, pp. 895-904. ACM, 2008.

http://www.esprockets.com/papers/adsorption-yt.pdf (Links to an external site.)

 

Agirre, Eneko, Oier López de Lacalle, and Aitor Soroa. "Random Walks for Knowledge-Based Word Sense Disambiguation." Computational Linguistics 40, no. 1, 2014, pp 57-84.

http://anthology.aclweb.org/J/J14/J14-1003.pdf (Links to an external site.)

 

 

 

 

Background Reading (optional, introductory)

Book "Text Data Management and Analysis" Chapter 10.3

Page, Lawrence, Sergey Brin, Rajeev Motwani, and Terry Winograd. The PageRank citation ranking: Bringing order to the web. Stanford InfoLab, 1999.

http://ilpubs.stanford.edu:8090/422/1/1999-66.pdf (Links to an external site.)

 

Mihalcea, Rada, and Dragomir Radev. Graph-based natural language processing and information retrieval. Cambridge University Press, 2011. ISBN:0521896134 9780521896139 

https://books.google.com/books?hl=en&lr=lang_en|lang_fr|lang_de&id=kByP4X9c5AQC&oi=fnd&pg=PA1&ots=_CIj4i4l5R&sig=iAhxKREc9v7qixKAr_gbK-SueEo#v=onepage&q&f=false (Links to an external site.)

Eppstein, David. "Finding the k shortest paths." SIAM Journal on computing 28, no. 2 (1998): 652-673.

http://www.ics.uci.edu/~eppstein/pubs/Epp-SJC-98.pdf (Links to an external site.)

 

 

Advanced Reading (Continue here if this was too easy)

Backstrom, Lars, and Jure Leskovec. "Supervised random walks: predicting and recommending links in social networks." In Proceedings of the fourth ACM international conference on Web search and data mining, pp. 635-644. ACM, 2011.

http://www-cs-faculty.stanford.edu/people/jure/pubs/linkpred-wsdm11.pdf (Links to an external site.)

Bahmani, Bahman, Kaushik Chakrabarti, and Dong Xin. "Fast personalized pagerank on mapreduce." In Proceedings of the 2011 ACM SIGMOD International Conference on Management of data, pp. 973-984. ACM, 2011.

https://www.kth.se/social/files/5526e966f276542e80c58ce5/mod113-bahmani.pdf (Links to an external site.)

Talukdar, P. P., Reisinger, J., Paşca, M., Ravichandran, D., Bhagat, R., & Pereira, F. (2008, October). Weakly-supervised acquisition of labeled class instances using graph random walks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (pp. 582-590). Association for Computational Linguistics.

http://www.anthology.aclweb.org/D/D08/D08-1.pdf#page=612





Agenda: Text Clustering

1. Evaluation measures. (slide deck: dskgt-eval.pdf Preview the document View in a new window )

2. Text Clustering: Introduction and discussion

3. Discussion of Graph Walk papers.

 

 

Mandatory Reading Assignments

Everyone must read (and summarize) these:

 

Chapter 14 in Text Data Management and Analysis

 

Make yourself familiar with Scikit-learn's Clustering package
 (Links to an external site.)

 

Further reading, everyone must read (and summarize) one from the following list:

 

Navigli, Roberto, and Giuseppe Crisafulli. "Inducing word senses to improve web search result clustering." In Proceedings of the 2010 conference on empirical methods in natural language processing, pp. 116-126. Association for Computational Linguistics, 2010.

http://clair.eecs.umich.edu/aan/paper.php?paper_id=D10-1012#pdf (Links to an external site.)

Rosenberg, Andrew, and Julia Hirschberg. "V-Measure: A Conditional Entropy-Based External Cluster Evaluation Measure." In EMNLP-CoNLL, vol. 7, pp. 410-420. 2007.

http://clair.eecs.umich.edu/aan/paper.php?paper_id=D07-1043#pdf (Links to an external site.)

McCreadie, Richard, Craig Macdonald, Iadh Ounis, Miles Osborne, and Sasa Petrovic. "Scalable distributed event detection for twitter." In Big Data, 2013 IEEE International Conference on, pp. 543-549. IEEE, 2013.

http://eprints.gla.ac.uk/89118/7/89118.pdf (Links to an external site.)

 

Haghighi, Aria, and Dan Klein. "Simple coreference resolution with rich syntactic and semantic features." In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 3-Volume 3, pp. 1152-1161. Association for Computational Linguistics, 2009.

http://lexitron.nectec.or.th/public/ACL-IJCNLP-2009_Singapore/EMNLP/pdf/EMNLP120.pdf (Links to an external site.)

 

McCallum, Andrew, Kamal Nigam, and Lyle H. Ungar. "Efficient clustering of high-dimensional data sets with application to reference matching." In Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 169-178. ACM, 2000.

ftp://ftp.cse.buffalo.edu/users/azhang/disc/disc01/cd1/out/papers/kdd/p169-mccallum.pdf (Links to an external site.)

 

Baker, L. Douglas, and Andrew Kachites McCallum. "Distributional clustering of words for text classification." In Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, pp. 96-103. ACM, 1998.

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.591.6845&rep=rep1&type=pdf (Links to an external site.)

 

 

 

Background Reading (optional, introductory)

Berkhin, Pavel. "A survey of clustering data mining techniques." In Grouping multidimensional data, pp. 25-71. Springer Berlin Heidelberg, 2006.

https://pdfs.semanticscholar.org/26f1/78dbb00630ce19cccb9840ea12dbe31801be.pdf (Links to an external site.)

 

Advanced Reading (Continue here if this was too easy)

Basu, Sugato, Mikhail Bilenko, and Raymond J. Mooney. "A probabilistic framework for semi-supervised clustering." In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 59-68. ACM, 2004.

http://www.cs.umb.edu/~ding/history/470_670_fall_2011/papers/Kyle%20McGivney%20A%20probabilistic%20framework%20for%20semi-supervised%20clustering.pdf (Links to an external site.)

Bekkerman, Ron, and Koby Crammer. "One-class clustering in the text domain." In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 41-50. Association for Computational Linguistics, 2008.

http://management.haifa.ac.il/images/info_people/ron_bekkerman_files/emnlp08.pdf (Links to an external site.)

Attendees



\



Agenda: Machine Learning methods and toolkits

  1. Scribe + Discussion presenters

  2. Next week: Prof Marinov (Talk Monday 5pm and discussion Tuesday morning - prepare questions)

  3. Code submission - Issues?

  4. Machine Learning & toolkits

    1. What is a joint probability distribution?

    2. Overfitting

    3. Cross validation

    4. Point estimation

    5. Curse of dimensionality

    6. Bias and Variance

    7. Starting simple and working your way up in model complexity

  5. Preparation for Code Submission 2

 

Topics:

- Mongo DB and data munging/massaging/wrangling, Hadoop & Map-reduce

- Question Answering (Agichtein)

- Intro NLP, Russel and Norvig Chapter 18/19 "Intro to NLP"  (maybe with entity linking)

 - Summarization  (Barzilay & Sauper)

- Visualization .... of what exactly?

 

 

 

 

Mandatory Reading

Book I. Goodfellow, Y. Bengio, A. Courville "Deep Learning", MIT Press, 2017, ISBN 9780262035613

Chapter 5.1, 5.2, and 5.3 (Links to an external site.)

 

Familiarize yourself with Scikit-Learn (Links to an external site.)

 

Further Reading / Videos

Martin Zinkevich. Rules of Machine Learning: Best Practices for ML Engineering  (from Google) http://martin.zinkevich.org/rules_of_ml/rules_of_ml.pdf (Links to an external site.)

 

A Visual Introduction to Machine Learning. http://www.r2d3.us/visual-intro-to-machine-learning-part-1/ (Links to an external site.)

 

Talk: Nathan Taggart. Machine Learning with Ponies (also used python)https://www.youtube.com/watch?v=xeAB10QgDW8 (Links to an external site.)



Agenda: Information retrieval methods and toolkits

  1. Scribe + Presenters

  2. Reading notes: Please include a detailed discussion of how it relates to the TREC CAR Prototype.

  3. Information Retrieval Paper discussion

  4. Prototype planning

 

Mandatory Reading

Book Text Data Management and Analysis

Chapter 6 - 6.3.1  and  6.4

 

PLUS: Two out of the following

 

Metzler, Donald, and W. Bruce Croft. "A Markov random field model for term dependencies." In Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 472-479. ACM, 2005.  http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.61.1097&rep=rep1&type=pdf (Links to an external site.)

Raiber, Fiana, and Oren Kurland. "Ranking document clusters using markov random fields." In Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval, pp. 333-342. ACM, 2013. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.474.781&rep=rep1&type=pdf (Links to an external site.)

Fang, Hui, Tao Tao, and ChengXiang Zhai. "A formal study of information retrieval heuristics." In Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 49-56. ACM, 2004.

http://sifaka.cs.uiuc.edu/taotao/publications/sigir04.pdf (Links to an external site.)

Lavrenko, Victor, and W. Bruce Croft. "Relevance based language models." In Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 120-127. ACM, 2001. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.193.3687&rep=rep1&type=pdf (Links to an external site.)

Toolkits

 

Familiarize yourself with one of the following toolkits:

- Lucene (Links to an external site.)

- Terrier (Links to an external site.) 

Galago (Links to an external site.) (Secret Galgo Docs (Links to an external site.))



Agenda: Entity Linking an Word Sense Disambiguation

  1. Scribe + Paper presenters

  2. Discussion Entity Linking

  3. Tools: TagMe + AIDA

  4. Implementation plan for next code submission

 

 

 

Mandatory Reading

Shen, Wei, Jianyong Wang, and Jiawei Han. "Entity linking with a knowledge base: Issues, techniques, and solutions." IEEE Transactions on Knowledge and Data Engineering 27, no. 2 (2015): 443-460.

http://www.gntsuntechnologies.com/Projects/2015_java_ieee/10.pdf (Links to an external site.)

Ferragina, P. and Scaiella, U., 2010, October. Tagme: on-the-fly annotation of short text fragments (by wikipedia entities). In Proceedings of the 19th ACM international conference on Information and knowledge management (pp. 1625-1628). ACM.

http://www.di.unipi.it/~ferragin/cikm2010.pdf (Links to an external site.)


Read two out of these

 

Ratinov, Lev, Dan Roth, Doug Downey, and Mike Anderson. "Local and global algorithms for disambiguation to wikipedia." In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1, pp. 1375-1384. Association for Computational Linguistics, 2011.

http://cogcomp.cs.illinois.edu/papers/ChengRo13.pdf (Links to an external site.)

Mihalcea, R. and Csomai, A., 2007, November. Wikify!: linking documents to encyclopedic knowledge. In Proceedings of the sixteenth ACM conference on Conference on information and knowledge management (pp. 233-242). ACM.

http://digital.library.unt.edu/ark:/67531/metadc31001/m2/1/high_res_d/Mihalcea-2007-Wikify-Linking_Documents_to_Encyclopedic.pdf (Links to an external site.)

Hasibi, F., Balog, K. and Bratsberg, S.E., 2016, March. On the reproducibility of the TAGME entity linking system. In European Conference on Information Retrieval (pp. 436-449). Springer International Publishing.

Yaghoobzadeh, Y. and Schütze, H., 2016. Corpus-level fine-grained entity typing using contextual information. Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 715–725, 2015.

https://www.aclweb.org/anthology/D/D15/D15-1083.pdf (Links to an external site.)

 

Wang, H., Zheng, J., Ma, X., Fox, P. and Ji, H., 2015. Language and Domain Independent Entity Linking with Quantified Collective Validation. In EMNLP (pp. 695-704).

http://www.aclweb.org/website/old_anthology/D/D15/D15-1081.pdf (Links to an external site.)

Liu, Xiaohua, Yitong Li, Haocheng Wu, Ming Zhou, Furu Wei, and Yi Lu. "Entity Linking for Tweets." In ACL (1), pp. 1304-1311. 2013.

http://www.aclweb.org/old_anthology/P/P13/P13-1128.pdf (Links to an external site.)

Background Reading

 

Edgar Meij, Krisztian Balog and Dann Odijk. 2014. Entity Linking and Retrieval. (Links to an external site.) Tutorial at WSDM2014, SIGIR2013, YSS2013 and WWW2013.

http://ejmeij.github.io/entity-linking-and-retrieval-tutorial/ (Links to an external site.)

 

Roth, Dan, Heng Ji, Ming-Wei Chang, and Taylor Cassidy. "Wikification and Beyond: The Challenges of Entity and Concept Grounding." In ACL (Tutorial Abstracts), p. 7. 2014.

https://pdfs.semanticscholar.org/bedb/08faf3336a9e931f3ed6a36fc4a86abb535c.pdf (Links to an external site.)

 

 

Toolkits

Familiarize yourself with TagMe (Links to an external site.) and/or the AIDA (Links to an external site.) entity linkers

 



Agenda: Information retrieval with entity links

  1. Scribe + Presenters

  2. Order of next topics?

  3. Paper Discussion

  4. How does this relate to TREC CAR?

 

Mandatory Reading

Dalton, Jeffrey, Laura Dietz, and James Allan. "Entity query feature expansion using knowledge base links." In Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval, pp. 365-374. ACM, 2014.

https://www.researchgate.net/profile/Laura_Dietz2/publication/266658685_Entity_query_feature_expansion_using_knowledge_base_links/links/54413ad60cf2a6a049a56765.pdf (Links to an external site.)

Plus Two of the Following

Brandão, Wladmir C., Rodrygo LT Santos, Nivio Ziviani, Edleno S. Moura, and Altigran S. Silva. "Learning to expand queries using entities." Journal of the Association for Information Science and Technology 65, no. 9 (2014): 1870-1883.

http://www.academia.edu/download/42582297/Learning_to_Expand_Queries_Using_Entitie20160211-13077-1gnypcp.pdf (Links to an external site.)

Blanco, Roi, Giuseppe Ottaviano, and Edgar Meij. "Fast and space-efficient entity linking for queries." In Proceedings of the Eighth ACM International Conference on Web Search and Data Mining, pp. 179-188. ACM, 2015.

https://pdfs.semanticscholar.org/51b5/cecc3881f1608ce53c4229682f55e1787fa6.pdf (Links to an external site.)

Liu, Xitong, and Hui Fang. "Latent entity space: a novel retrieval approach for entity-bearing queries." Information Retrieval Journal 18, no. 6 (2015): 473-503.

http://xtliu.com/pub/inrj15-les.pdf (Links to an external site.)

Raviv, Hadas, David Carmel, and Oren Kurland. "A ranking framework for entity oriented search using Markov random fields." In Proceedings of the 1st Joint International Workshop on Entity-Oriented and Semantic Search, p. 1. ACM, 2012.

http://sme.technion.ac.il/~kurland/entityMRF.pdf (Links to an external site.)

Hasibi, Faegheh, Krisztian Balog, and Svein Erik Bratsberg. "Entity linking in queries: Tasks and evaluation." In Proceedings of the 2015 International Conference on The Theory of Information Retrieval, pp. 171-180. ACM, 2015.

http://krisztianbalog.com/files/ictir2015-erd.pdf (Links to an external site.)


Zhiltsov, Nikita, Alexander Kotov, and Fedor Nikolaev. "Fielded sequential dependence model for ad-hoc entity retrieval in the web of data." In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 253-262. ACM, 2015.

http://ai2-s2-pdfs.s3.amazonaws.com/858d/800c51a29f67b94f369b8ee79668741ca8cc.pdf (Links to an external site.)

Further Reading

Dietz, Laura, Alexander Kotov, and Edgar Meij. "Utilizing Knowledge Graphs in Text-centric Information Retrieval." In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining, pp. 815-816. ACM, 2017.

https://www.researchgate.net/profile/Alexander_Kotov3/publication/313264939_Utilizing_Knowledge_Graphs_in_Text-centricInformation_Retrieval/links/5899ac5e4585158bf6f85637/Utilizing-Knowledge-Graphs-in-Text-centricInformation-Retrieval.pdf?origin=publication_detail&ev=pub_int_prw_xdl&msrp=Ijt7dnPBiAK7kfsXhwhaoJHcfIhBe7qGAkb-gjlmv6HLvfoOFy_xwOLM_oWPpWG-eLgTCBhenMj81wDCEh-948Nz33HDAxszNwoRJOE5Z4Q.t3goFySwyKFyDl6Vba7XjFkftLeEIOUaIteodxrmXQMYP_u5gy5hTcAenGa4DnaQHIKczcqEbz5UyUsSqg7DXA.I_MwldH0GlqAV9F9m0CAg1RDdDV9Zm5dCMJHhBI1qgFXl36AJFWgkS6xdUCfqF0y9VViykinis_doQClWXQJqg (Links to an external site.)

Slides are available online: http://github.com/laura-dietz/tutorial-utilizing-kg



Agenda: Graph Clustering

  1. Scribe

  2. Late homework submissions

  3. Paper discussion

  4. Prepare for Thursday: First batch of evaluation results.

 

 

 

Read two papers of your choice.

You have the choice between a very long but complete and very easy to follow introductory read, as well as graph clustering works from the database, NLP, and machine learning communities.

 

 

Introductory Reading

Schaeffer, Satu Elisa. "Graph clustering." Computer science review 1, no. 1 (2007): 27-64.

http://leonidzhukov.net/hse/2015/socialnetworks/papers/GraphClustering_Schaeffer07.pdf (Links to an external site.)

 

Intermediate Reading

Zhou, Yang, Hong Cheng, and Jeffrey Xu Yu. "Graph clustering based on structural/attribute similarities." Proceedings of the VLDB Endowment 2, no. 1 (2009): 718-729.
http://www1.se.cuhk.edu.hk/~hcheng/summer2010/paper/vldb09-175.pdf (Links to an external site.)

 

 

Biemann, Chris. "Chinese whispers: an efficient graph clustering algorithm and its application to natural language processing problems." In Proceedings of the first workshop on graph based methods for natural language processing, pp. 73-80. Association for Computational Linguistics, 2006.
https://www.lt.informatik.tu-darmstadt.de/fileadmin/user_upload/Group_LangTech/publications/pre-langtech/Biemann_CW_TextGraph06.pdf (Links to an external site.)

 

 

Flake, Gary William, Robert E. Tarjan, and Kostas Tsioutsiouliklis. "Graph clustering and minimum cut trees." Internet Mathematics 1, no. 4 (2004): 385-408.

http://projecteuclid.org/download/pdf_1/euclid.im/1109191029 (Links to an external site.)

 

 

Advanced Reading

Kulis, Brian, Sugato Basu, Inderjit Dhillon, and Raymond Mooney. "Semi-supervised graph clustering: a kernel approach." Machine learning 74, no. 1 (2009): 1-22.

http://machinelearning.wustl.edu/mlpapers/paper_files/icml2005_KulisBDM05.pdf (Links to an external site.)

 



Agenda: Relation Extraction

  1. Scribe

  2. Prototype 1 submission + Plan for Prototype 2 (submit by Wednesday)

  3. Discussion: Relation Extraction

 

Reading

 

 

Please read:

1x Schema-based Relation Extraction

1x Open Relation Extraction

2x additional paper of your choice.

 

 

General

 

Bach, Nguyen, and Sameer Badaskar. "A review of relation extraction." Literature review for Language and Statistics II (2007).  http://orb.essex.ac.uk/CE/CE807/Readings/A-survey-on-Relation-Extraction.pdf (Links to an external site.)

Pantel, Patrick, and Marco Pennacchiotti. "Espresso: Leveraging generic patterns for automatically harvesting semantic relations." In Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics, pp. 113-120. Association for Computational Linguistics, 2006. http://www.anthology.aclweb.org/P/P06/P06-1.pdf#page=153 (Links to an external site.)

Lin, D. and Pantel, P., 2001. Discovery of inference rules for question-answering. Natural Language Engineering, 7(04), pp.343-360. http://courses.cs.washington.edu/courses/cse573/08au/papers/pantel.pdf (Links to an external site.)

Schema-based Relation Extraction


Mintz, Mike, Steven Bills, Rion Snow, and Dan Jurafsky. "Distant supervision for relation extraction without labeled data." In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2-Volume 2, pp. 1003-1011. Association for Computational Linguistics, 2009. https://www.aclweb.org/anthology/P/P09/P09-1113.pdf (Links to an external site.)

Riedel, Sebastian, Limin Yao, Andrew McCallum, and Benjamin M. Marlin. "Relation extraction with matrix factorization and universal schemas." (2013). http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.378.5619&rep=rep1&type=pdf#page=112 (Links to an external site.)

Open Relation Extraction

Etzioni, Oren, Michele Banko, Stephen Soderland, and Daniel S. Weld. "Open information extraction from the web." Communications of the ACM 51, no. 12 (2008): 68-74. http://www.cs.washington.edu/research/projects/aiweb/media/papers/tmpcLeDnr.pdf (Links to an external site.)

Del Corro, Luciano, and Rainer Gemulla. "Clausie: clause-based open information extraction." In Proceedings of the 22nd international conference on World Wide Web, pp. 355-366. ACM, 2013. http://www2013.wwwconference.org/proceedings/p355.pdf



Agenda: MongoDB / Map Reduce / Data Wrangling

  1. Scribe

  2. Discussion: How to use Data Wrangling techniques to solve TREC CAR?

  3. Paper presentation by Bahram

  4. Discussion Prototype 2 Implementation Plans.

 

Mandatory Reading

You can keep your reading notes brief.

 

 

 



Agenda: Information Retrieval with Relations

  1. Scribe

  2. Group discussion: what in these papers can be used in TREC CAR?

  3. Paper Presentation. Presenter: Colin

  4. Questions regarding code submission Prototype 2

 

Upcoming events:

 

 

 

 

 Mandatory Reading

 

Voskarides, Nikos, Edgar Meij, Manos Tsagkias, Maarten de Rijke, and Wouter Weerkamp. "Learning to Explain Entity Relationships in Knowledge Graphs.", Proceedings of the 53rd Annual Meeting of the Association for Computational Linguisticsand the 7th International Joint Conference on Natural Language Processing, pages 564–574, Beijing, China, July 26-31, 2015.

http://anthology.aclweb.org/P/P15/P15-1055.pdf (Links to an external site.)

Schuhmacher, Michael, Benjamin Roth, Simone Paolo Ponzetto, and Laura Dietz. "Finding relevant relations in relevant documents." In European Conference on Information Retrieval, pp. 654-660. Springer International Publishing, 2016.

I originally pasted a URL to a different paper.

Here is the correct paper: https://ub-madoc.bib.uni-mannheim.de/41295/1/schuhmacher16a.pdf (Links to an external site.)

Reinanda, Ridho, Edgar Meij, and Maarten de Rijke. "Mining, ranking and recommending entity aspects." In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 263-272. ACM, 2015. https://pdfs.semanticscholar.org/e667/c31119b6e56ea73cfeda8752bc5031025fd2.pdf



Agenda: Topic Models

Mandatory Reading

Blei, David M. "Probabilistic topic models." Communications of the ACM 55, no. 4 (2012): 77-84.  http://www.cs.princeton.edu/~blei/papers/Blei2011.pdf (Links to an external site.)

Kataria, Saurabh S., Krishnan S. Kumar, Rajeev R. Rastogi, Prithviraj Sen, and Srinivasan H. Sengamedu. "Entity disambiguation with hierarchical topic models." In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 1037-1045. ACM, 2011.

 https://pdfs.semanticscholar.org/4824/837f551235398dd8300984cb29629aaa3c90.pdf (Links to an external site.)

Chang, Jonathan, Jordan L. Boyd-Graber, Sean Gerrish, Chong Wang, and David M. Blei. "Reading tea leaves: How humans interpret topic models." In Nips, vol. 31, pp. 1-9. 2009. https://papers.nips.cc/paper/3700-reading-tea-leaves-how-humans-interpret-topic-models.pdf (Links to an external site.)

Background info on Gibbs Sampling for Topic models

Heinrich, Gregor. "Parameter estimation for text analysis." University of Leipzig, Tech. Rep (2008). http://faculty.cs.byu.edu/~ringger/CS601R/papers/Heinrich-GibbsLDA.pdf (Links to an external site.)

Introductory Reading

Chapter 17 of book "Text Data Management and Analysis", Zhai & Massung, 2016.

Also see appendix A of the same book.



Agenda: Network Topic Models

Read Two

 

Li, Wei, and Andrew McCallum. "Pachinko allocation: DAG-structured mixture models of topic correlations." In Proceedings of the 23rd international conference on Machine learning, pp. 577-584. ACM, 2006. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.86.8142&rep=rep1&type=pdf (Links to an external site.)

Dietz, Laura, Steffen Bickel, and Tobias Scheffer. "Unsupervised prediction of citation influences." In Proceedings of the 24th international conference on Machine learning, pp. 233-240. ACM, 2007. http://machinelearning.wustl.edu/mlpapers/paper_files/icml2007_DietzBS07.pdf (Links to an external site.)

Dietz, Laura, Ben Gamari, John Guiver, Edward Snelson, and Ralf Herbrich. "De-Layering Social Networks by Shared Tastes of Friendships." In ICWSM. 2012. http://ciir.cs.umass.edu/~dietz/delayer/dietz-cameraready.pdf (Links to an external site.)

Rosen-Zvi, Michal, Thomas Griffiths, Mark Steyvers, and Padhraic Smyth. "The author-topic model for authors and documents." In Proceedings of the 20th conference on Uncertainty in artificial intelligence, pp. 487-494. AUAI Press, 2004.  https://arxiv.org/pdf/1207.4169 (Links to an external site.)

Newman, David, Chaitanya Chemudugunta, and Padhraic Smyth. "Statistical entity-topic models." In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 680-686. ACM, 2006. http://datalab.ics.uci.edu/papers/rtpp331_newman.pdf (Links to an external site.)

Balasubramanyan, Ramnath, and William W. Cohen. "Block-LDA: Jointly modeling entity-annotated text and entity-entity links." In Proceedings of the 2011 SIAM International Conference on Data Mining, pp. 450-461. Society for Industrial and Applied Mathematics, 2011. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.673.2347&rep=rep1&type=pdf (Links to an external site.)

Chang, Jonathan, Jordan Boyd-Graber, and David M. Blei. "Connections between the lines: augmenting social networks with text." In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 169-178. ACM, 2009. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.684.3500&rep=rep1&type=pdf (Links to an external site.)

Introductory ReadingRead Two

 

Li, Wei, and Andrew McCallum. "Pachinko allocation: DAG-structured mixture models of topic correlations." In Proceedings of the 23rd international conference on Machine learning, pp. 577-584. ACM, 2006. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.86.8142&rep=rep1&type=pdf (Links to an external site.)

Dietz, Laura, Steffen Bickel, and Tobias Scheffer. "Unsupervised prediction of citation influences." In Proceedings of the 24th international conference on Machine learning, pp. 233-240. ACM, 2007. http://machinelearning.wustl.edu/mlpapers/paper_files/icml2007_DietzBS07.pdf (Links to an external site.)

Dietz, Laura, Ben Gamari, John Guiver, Edward Snelson, and Ralf Herbrich. "De-Layering Social Networks by Shared Tastes of Friendships." In ICWSM. 2012. http://ciir.cs.umass.edu/~dietz/delayer/dietz-cameraready.pdf (Links to an external site.)

Rosen-Zvi, Michal, Thomas Griffiths, Mark Steyvers, and Padhraic Smyth. "The author-topic model for authors and documents." In Proceedings of the 20th conference on Uncertainty in artificial intelligence, pp. 487-494. AUAI Press, 2004.  https://arxiv.org/pdf/1207.4169 (Links to an external site.)

Newman, David, Chaitanya Chemudugunta, and Padhraic Smyth. "Statistical entity-topic models." In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 680-686. ACM, 2006. http://datalab.ics.uci.edu/papers/rtpp331_newman.pdf (Links to an external site.)

Balasubramanyan, Ramnath, and William W. Cohen. "Block-LDA: Jointly modeling entity-annotated text and entity-entity links." In Proceedings of the 2011 SIAM International Conference on Data Mining, pp. 450-461. Society for Industrial and Applied Mathematics, 2011. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.673.2347&rep=rep1&type=pdf (Links to an external site.)

Chang, Jonathan, Jordan Boyd-Graber, and David M. Blei. "Connections between the lines: augmenting social networks with text." In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 169-178. ACM, 2009. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.684.3500&rep=rep1&type=pdf (Links to an external site.)

Introductory Reading

Chapter 19 of book "Text Data Management and Analysis", Zhai & Massung, 2016.

 



Chapter 19 of book "Text Data Management and Analysis", Zhai & Massung, 2016.

 

Agenda: Intro to NLP



Presenter: Reazul

 

I will be joining the discussion online.

 

Mandatory Reading

Chapter 3 in the book of Zhai and Massung. "Text Data Management and Analysis".

Bird, S., 2006, July. NLTK: the natural language toolkit. In Proceedings of the COLING/ACL on Interactive presentation sessions (pp. 69-72). Association for Computational Linguistics. http://www.aclweb.org/anthology/P06-4#page=79 (Links to an external site.)

Manning, Christopher D., Mihai Surdeanu, John Bauer, Jenny Rose Finkel, Steven Bethard, and David McClosky. "The stanford corenlp natural language processing toolkit." In ACL (System Demonstrations), pp. 55-60. 2014. http://www.aclweb.org/website/old_anthology/P/P14/P14-5.pdf#page=67 (Links to an external site.)

Toolkits



 






Agenda: Crowd Sourcing Training Data



Mandatory Reading

Alonso, Omar, Daniel E. Rose, and Benjamin Stewart. "Crowdsourcing for relevance evaluation." In ACM SigIR Forum, vol. 42, no. 2, pp. 9-15. ACM, 2008. http://www.cs.northwestern.edu/~pardo/courses/mmml/papers/collaborative_filtering/crowdsourcing_for_relevance_evaluation_SIGIR08.pdf (Links to an external site.)

 

Kazai, Gabriella, and Natasa Milic-Frayling. "On the evaluation of the quality of relevance assessments collected through crowdsourcing." In SIGIR 2009 Workshop on the Future of IR Evaluation, p. 21. 2009. https://pdfs.semanticscholar.org/d631/31633e630d7d14d3d18d6ad0caf456c86cf7.pdf (Links to an external site.)

Azzopardi, Leif, Maarten De Rijke, and Krisztian Balog. "Building simulated queries for known-item topics: an analysis using six european languages." In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 455-462. ACM, 2007. http://eprints.gla.ac.uk/3864/1/azzopardi3864.pdf (Links to an external site.)

Savenkov, Denis, Scott Weitzner, and Eugene Agichtein. "Crowdsourcing for (almost) real-time question answering." In Workshop on Human-Computer Question Answering, NAACL. 2016. http://www.aclweb.org/anthology/W/W16/W16-0102.pdf (Links to an external site.)

Toolkits

Amazon Mechanical Turk  https://www.mturk.com/mturk/welcome (Links to an external site.)

Crowdflower https://www.crowdflower.com/



Agenda: Summarization



Mandatory Reading

Nenkova, Ani, and Kathleen McKeown. "A survey of text summarization techniques." In Mining text data, pp. 43-76. Springer US, 2012. https://pdfs.semanticscholar.org/8d7f/6dc8b0b9101580cc96f1f303d1eba3d590af.pdf (Links to an external site.)

Blanco, R. and Lioma, C., 2012. Graph-based term weighting for information retrieval. Information retrieval, 15(1), pp.54-92. http://www.diku.dk/~c.lioma/publications/irj2012.pdf (Links to an external site.)

Optional Reading

Ouyang, You, Wenjie Li, Sujian Li, and Qin Lu. "Applying regression models to query-focused multi-document summarization." Information Processing & Management 47, no. 2 (2011): 227-237. https://www.researchgate.net/profile/Qin_Lu3/publication/220229610_Applying_regression_models_to_query-focused_multi-document_summarization/links/00b7d52f33e9ceb4f8000000.pdf (Links to an external site.)

 

 

Introductory Reading

Chapter 16 of book "Text Data Management and Analysis", Zhai & Massung, 2016.

 

 

Bryan, feel free to suggest another paper.



Agenda: Question Answering

 

Mandatory Reading

Allam, Ali Mohamed Nabil, and Mohamed Hassan Haggag. "The question answering systems: A survey." International Journal of Research and Reviews in Information Sciences (IJRRIS) 2, no. 3 (2012). http://www.aliallam.net/upload/598575/documents/ECFF549932079694.pdf (Links to an external site.)

 

Plus One of These

Savenkov, Denis, and Eugene Agichtein. "When a knowledge base is not enough: Question answering over knowledge bases with external text data." In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, pp. 235-244. ACM, 2016. https://pdfs.semanticscholar.org/ee8e/a5af5cb957a912331c3fb0fd6f169ad79630.pdf (Links to an external site.)

Gondek, D. C., Adam Lally, Aditya Kalyanpur, J. William Murdock, Pablo Ariel Duboué, Lei Zhang, Yue Pan, Z. M. Qiu, and Chris Welty. "A framework for merging and ranking of answers in DeepQA." IBM Journal of Research and Development 56, no. 3.4 (2012): 14-1. https://pdfs.semanticscholar.org/c094/4b6759e2e1a4026ef43936ee00c0ddb3d79a.pdf (Links to an external site.)

 

Fan, James, Aditya Kalyanpur, David C. Gondek, and David A. Ferrucci. "Automatic knowledge extraction from documents." IBM Journal of Research and Development 56, no. 3.4 (2012): 5-1. http://brenocon.com/watson_special_issue/05%20automatic%20knowledge%20extration.pdf (Links to an external site.)

Oh, J.H., Torisawa, K., Hashimoto, C., Iida, R., Tanaka, M. and Kloetzer, J., 2016, February. A semi-supervised learning approach to why-question answering. In Proceedings of the thirtieth aaaI Conference on artificial Intelligence (pp. 3022-3029). AAAI Press. http://www.aaai.org/ocs/index.php/AAAI/AAAI16/paper/download/12208/12056 (Links to an external site.)

Tsur, Gilad, Yuval Pinter, Idan Szpektor, and David Carmel. "Identifying web queries with question intent." In Proceedings of the 25th International Conference on World Wide Web, pp. 783-793. International World Wide Web Conferences Steering Committee, 2016. http://www.cc.gatech.edu/~ypinter3/papers/2016_prefex-www-proc.pdf (Links to an external site.)