Web Search and Text Mining
CS 8803 WST: Web Search and Text Mining
Fall, 2007
Lecture: 3 hours, TR 3:05pm - 4:25pm, Howey (Physics) L3
Office hours: T 1:30pm - 2:30pm.
Instructor:
Hongyuan Zha, Office: 1314 KACB, Phone: 404-385-1491
Course Description
We all have experience using search engines:
typing a query into the search box and browsing
the result sets to narrow down to the documents we need.
In fact, Web search is the most popular online activities
second only to email use, and it drives the business of
some of the most successful Web companies such as
Google and Yahoo! Commercial search engines are
complicated engineering systems, and one of the
purposes of this course is to take you behind the
scenes to explore the enabling scientific and
algorithmic advances. Besides issues in Web search,
we will also explore other text mining methods
such as document clustering and classification,
information extraction, and many other ways to
process and analyze free-text data and their use
in bioscience and health information technology.
We will start with very basic notions in information
retrieval at a fairly slow pace, explain some of the
fundamental algorithms, and eventually touch upon
the research frontiers. The prerequisites for this
course are very modest, basic exposure to computing
and algorithms, and basic knowledge of calculus,
linear algebra and statistics. The course will
also have some programming assignments.
List of Topics
- Introductory IR: inverted indices, query processing, tf-idf weighting, scoring, precision and recall, DCG
- IR models: vector space models, probabilistic models, and language modeling methods
- Machine learning methods for designing ranking functions
- Link analysis: PageRank and HITS algorithms
- Implicit relevance feedback using user click data
- Social search, community discovery, exploring user tags
- Search results summarization and clustering
- Query re-writing, spelling corrections
- Question answering
- Document classification and clustering, applications in ED chief complaints and bio-surveillance
- Information extraction, applications in electronic medical records management
- Spam filtering methods
Class Policies
- Please let me know as soon as possible if you have any special needs during the semester.
- Each student must read and abide by the Georgia Tech Academic Honor Code.
Grading
- Homework and Projects: 50%
- Participation and presentation: 50%