WEB SEARCH and TEXT MINING
Spring 2011 undergraduate and graduate course
CSE 6240 (30251)

T-Square: http://t-square.gatech.edu.

Basic and advanced methods for Web information retrieval and text mining: indexing and crawling, IR models, link and click data, social search, text classification and clustering.

Instructor: Alexander Gray, College of Computing, Georgia Tech
TA: Hong Yu, hyu8 at gatech dot edu
Instructor office hours: Right after a lecture.
TA office hours: TBA

Where and when
MWF 10:05am - 10:55am, 215 Instructional Center. First class: Wednesday 1/19/11.

Book
Introduction to Information Retrieval by Christopher D. Manning, Prabhakar Raghavan and Hinrich Schutze ("IIR"). It is available for FREE online here.
Syllabus
Date Topic Reading Assignments
Mon 1/17 Holiday (MLK)
Wed 1/19 Course overview: topics and syllabus; course logistics Lecture 0 Honor Policy
Fri 1/21 Boolean retrieval IIR Chap 1, Lecture 1
Mon 1/24 The term vocabulary and postings lists IIR Chap 2, Lecture 2
Wed 1/26 Dictionaries and tolerant retrieval IIR Chap 3, Lecture 3
Fri 1/28 Project presentations: 1. Leber: Skip Lists: A Probabilistic Alternative to Balanced Lists. 2. Kaufmann: Suggesting Friends Using the Implicit Social Graph. Labbate: Context-Aware Recommender Systems.
Mon 1/31 Index construction IIR Chap 4, Lecture 4
Wed 2/2 Index compression IIR Chap 5, Lecture 5
Mon 2/7 Scoring, term weighting, and the vector space model IIR Chap 6, Lecture 6
Wed 2/9 Computing scores in a complete search system IIR Chap 7, Lecture 7
Fri 2/11 Project presentations: 1. Lee: Parallelized Boosting with Map-Reduce. 2. Ghosal: Modeling Anchor Text and Classifying Queries to Enhance Web Document Retrieval. 3. Joshi: A Regression Framework for Learning Ranking Functions Using Relative Relevance Judgments.
Mon 2/14 Evaluation in information retrieval IIR Chap 8, Lecture 8
Wed 2/16 Relevance feedback and query expansion IIR Chap 9, Lecture 9
Fri 2/18 Project presentations: 1. Spidlen: Map-Reduce: Simplified Data Processing on Large Clusters. 2. Chitnis: Authoritative Sources in a Hyperlinked Environment. 3. Bhagya: Spelling Correction as an Iterative Process that Exploits the Collective Knowledge of Web Users.
Fri 2/25 Project presentations: 1. Rangadurai and Ramesh: Map-Reduce for Machine Learning on Multicore. 2. Aggarwal: Topic-Sensitive PageRank: A Context-Sensitive Ranking Algorithm for Web Search. 3. Ramar: A Regression Framework for Learning Ranking Functions Using Relative Relevance Judgments.
Mon 2/28 XML retrieval IIR Chap 10, Lecture 10
Wed 3/2 Probabilistic information retrieval IIR Chap 11, Lecture 11
Wed 3/9 The Watson question-answering system AI Magazine article
Fri 3/11 Project presentations: 1. Perry: An Exploration of Proximity Measures in Information Retrieval. 2. Weng: Exploring Web Scale Language Models for Search Query Processing.
Mon 3/14 Language models for information retrieval IIR Chap 12, Lecture 12
Wed 3/16 Text classification and naive Bayes IIR Chap 13, Lecture 13 1.1, 1.9, 2.14, 3.3, 3.8 (due 3/30 5pm)
Fri 3/18 Project presentations: 1. Raman: Why Inverse Document Frequency? 2. Ramanujam: Exploring Reductions for Long Web Queries. 3. Kumar and Madhusudana: Towards recency ranking in web search AND Improving Recency Ranking Using Twitter Data.
Mon 3/28 Vector space classification IIR Chap 14, Lecture 14
Wed 3/30 Support vector machines and learning to rank IIR Chap 15, Lecture 15 4.3, 4.6, 5.5, 5.10, 6.18, 6.19 (due 4/6 5pm)
Fri 4/1 Project presentations: 1. Koenig: How Good is a Span of Terms? Exploiting Proximity to Improve Web Retrieval. 2. Kumar and Achar: Answering Questions with an N-gram based Passage Retrieval Engine. 3. Hajarnis and Medhekar: Semi-supervised Document Classification with a Mislabeling Error Model.
Mon 4/4 Flat clustering IIR Chap 16, Lecture 16
Wed 4/6 Hierarchical clustering IIR Chap 17, Lecture 17 7.6, 8.9, 9.5, 9.6 (due 4/13 5pm)
Fri 4/8 Project presentations: 1. Stephens: Twitter Mood Predicts the Stock Market. 2. Lee and Kim: A Comparison of Event Models for Naive Bayes Text Classification. 3. Feng: Generative and Discriminative Classifiers: Naive Bayes and Logistic Regression.
Mon 4/11 Matrix decompositions and latent semantic indexing IIR Chap 18, Lecture 18
Wed 4/13 Web search basics IIR Chap 19, Lecture 19 10.2, 11.2, 12.9 (due 4/20 5pm)
Fri 4/15 Project presentations: 1. Som: Diversifying Search Results. 2. Asokan: Optimizing Search Engines using Clickthrough Data.
Mon 4/18 Web crawling and indexes IIR Chap 20, Lecture 20
Wed 4/20 Link analysis IIR Chap 21, Lecture 21 13.4, 13.10, 14.6, 15.2 (due 4/27 5pm)
Fri 4/22 Project presentations: 1. Galvin: Unsupervised Query Segmentation Using Generative Language Models and Wikipedia. 2. Subbanarasimhia: Jigsaw: Supporting Investigative Analysis Through Interactive Visualization.
Mon 4/25 Recommender Systems: The Art and Science of Matching Items to Users, Guest speaker: Deepak Agarwal, Yahoo! Research Take-home final: 16.3, 17.2, 18.2, 18.4, 19.1, 19.10, 20.1, 21.10, 21.12 (due Wed 5/4 5pm)
Fri 4/29 Project presentations: 1. Tyler: Mining Generalized Association Rules on Biomedical Literature. 2. Ravu: TagiCoFi: Tag Informed Collaborative Filtering
What's this course about?
We all have experience using search engines: typing a query into the search box and browsing the result sets to narrow down to the documents we need. In fact, Web search is the most popular online activities second only to email use, and it drives the business of some of the most successful Web companies such as Google and Yahoo! Commercial search engines are complicated engineering systems, and one of the purposes of this course is to take you behind the scenes to explore the enabling scientific and algorithmic advances. Besides issues in Web search, we will also explore other text mining methods such as document clustering and classification, information extraction, and many other ways to process and analyze free-text data and their use in various scientific and business applications. We will start with very basic notions in information retrieval at a fairly slow pace, explain some of the fundamental algorithms, and eventually touch upon the research frontiers.

How is the course graded?
1. Assignments: 50%. There will be weekly assignments consisting of selected exercises from the book. Each week's assignment is due the following Monday before 5pm.

2. Project: 50%. Choose a paper from the recent literature or from a provided list, understand it, and if possible, show experiments with it on some text data. Then give a presentation explaining the paper, showing any experiments, and discussing any ideas for extensions.

Should you take this course?
What is the intended audience of this course? Anyone who wants to understand the computer science (algorithms, data structures, and machine learning) behind commercial web search engines and text mining applications.

How hard is this course? This course is open to both undergraduate and graduate students, and is thus aimed somewhere in between, in difficulty.

What background is needed for this course? There are no formal prerequisites for this course. The prerequisites for doing well in this course are fairly modest: basic exposure to computing and algorithms, and basic knowledge of calculus, linear algebra and probability. Machine learning background is not needed, as we will cover the relevant machine learning topics. The course will also have some programming assignments, and your project will likely require programming.

Class policies
Please let the instructor know as soon as possible if you have any special needs during the semester, and appropriate arrangements will be made. Each student must read and abide by the Georgia Tech Academic Honor Code. See the first assignment for the policy in this course on plagiarism, which will not be tolerated.