Introduction to Data Mining and Analysis
Computational techniques for analysis of large, complex datasets, covering fundamental aspects as well as modern data mining and analysis techniques.

Fall 2010 undergraduate course
CS 4245 (89658); cross-listed as ISYE 4245

Instructor: Alexander Gray, College of Computing, Georgia Tech
Grader: Sujit Garikipati


Syllabus (tentative)
Date Topic Subtopics Reading Assignments
Tue 8/24 Data Mining and Analysis Overview I Examples of ML; tasks of ML; course logistics Chapter 1
Lecture 1
Thu 8/26 Data Mining and Analysis Overview II Parametric vs. nonparametric; parts of ML; generalization and over/under-fitting; cross-validation Chapter 2, 7.10, 13.3
Lecture 2
Tue 8/31 Basic Concepts of Probability and Statistics I Distributions; manipulating probabilities; statistics none (lecture only)
Lecture 3
HW1 out
Thu 9/2 Basic Concepts of Probability and Statistics II Parametric density estimation; maximum likelihood; generative classification none (lecture only)
Lecture 4
Tue 9/7 Basic Concepts of Probability and Statistics III Density estimation task; mixture of Gaussians; optimization; EM algorithm Sections 6.8, 8.5, 10.10 intro, 10.10.1, 12.2.3
Lecture 5
HW1 due; HW2 out
Thu 9/9 Basic Concepts of Probability and Statistics IV Nonparametric estimation; histogram; kernel density estimation; bias-variance tradeoff Sections 2.9, 6.1, 6.2
Lecture 6
Tue 9/14 Supervised Learning I Kernel discriminant analysis; kernel regression; temporalization Sections 6.3, 6.6, 6.8
Lecture 7
HW2 due; HW3 out
Thu 9/16 Supervised Learning II Linear regression; ridge regression and lasso; regularization; neural networks Sections 3.1, 3.4, 11.1-11.8, 11.9, 4.5
Lecture 8
Tue 9/21 Supervised Learning III Logistic regression; linear support vector machine Sections 4.1, 4.2, 4.3, 4.4, 12.1, 12.2
Lecture 9
HW3 due; take-home midterm out
Thu 9/23 Supervised Learning IV Kernel trick; kernels; kernelized support vector machine Sections 12.3.1-12.3.4, 5.8
Lecture 10
Tue 9/28 Supervised Learning V Nearest-neighbor; decision trees Sections 9.1, 9.2
Lecture 11
Take-home midterm due; HW4 out
Thu 9/30 Above Learning I Bootstrap; bagging Sections 8.2, 8.7
Lecture 12
Tue 10/5 Above Learning II Random forests; stacking; boosting Sections 15.1-15.4, 8.8, 10.1-10.6, 10.11
Lecture 13
HW4 due; HW5 out
Thu 10/7 Above Learning III Feature selection; cross-validation with feature selection Sections 3.3, 3.6, 10.13, 7.10
Lecture 14
Tue 10/12 Unsupervised Learning I Clustering; k-means; how to choose k Sections 14.1, 14.3.1-14.3.8, 14.3.10-14.3.11
Lecture 15
HW5 due; HW6 out
Thu 10/14 Unsupervised Learning II Constrained clustering; hierarchical clustering; mean-shift; biclustering Section 14.3.12
Lecture 16
Tue 10/19 Fall recess
Thu 10/21 Unsupervised Learning III Association rules Section 14.2
Lecture 17
HW6 due; HW7 out
Tue 10/26 Unsupervised Learning IV Dimension reduction; principal component analysis Section 14.5.1
Lecture 18
Thu 10/28 Unsupervised Learning V Independent component analysis; multidimensional scaling; manifold learning Section 14.7-14.9
Lecture 19
HW7 due; HW8 out
Tue 11/2 Practical Issues and Validation I Asymptotic distributions; statistical inequalities; confidence bands none (lecture only)
Lecture 20
Thu 11/4 Practical Issues and Validation II Computation: fast sums and searches; multidimensional trees none (lecture only)
Lecture 21
HW8 due; HW9 out
Tue 11/9 Practical Issues and Validation III Computation: unconstrained optimization; constrained optimization none (lecture only)
Lecture 22
Tue 11/16 Practical Issues and Validation IV Comparing learners; hypothesis testing none (lecture only)
Lecture 23
HW9 due; HW10 out
Tue 11/23 Practical Issues and Validation V Data issues: types of data (structured, non-vector, compressed); outliers and robustness; corrupted, noisy, expensive, and heterogeneous data none (lecture only)
Lecture 24
HW10 due
Thu 11/25 Holiday
Tue 11/30 Practical Issues and Validation VI Visualizing and presenting data none (lecture only)
Lecture 25
Thu 12/2 Practical Issues and Validation VII The entire data analysis process; styles of methods; when to use which methods; things I didn't teach you none (lecture only)
Lecture 26
Tue 12/7 Review Session
Thu 12/16 Final Exam Exam: 2:50-5:40pm
What's this course about?
Data is everywhere now, and advanced data analysis methods (variously called "machine learning", "data mining", and "pattern recognition") are now in use everywhere. This course will provide an undergraduate-level introduction to the techniques underlying such important modern applications as the algorithms that determine which ads Google shows you when you search, the movie recommendation system used by Netflix, the automatic zipcode recognition behind 97% of all mail delivery, the automatic speech recognition that happens when you call a company's help line, stock market prediction methods used in automatic trading, next-generation medical diagnostics that can detect whether you have cancer before it is too late to treat you, and methodology used by astrophysicists to uncover the origins and nature of the universe. The course will be peppered with these and other examples, some of them from the instructor's actual research work.

Should you take this course?
Yes, probably! The intended audience for this course is anyone who would like to be able to competently apply data analysis methods to real-world problems, which in my opinion requires a minimal rigorous mathematical understanding of the underpinnings of the methods. These mathematical basics will also serve as a good foundation for taking more advanced courses in this area.

Where and when
TuTh 4:35pm-5:55pm, Klaus 2447. First class: Tuesday 8/24/10. My office hours: grab me right after a lecture.

Book
The Elements of Statistical Learning, 2nd edition by Hastie, Tibshirani, and Friedman ("ESLII") -- it is FREE ONLINE! (http://www-stat.stanford.edu/~tibs/ElemStatLearn) However, I recommend buying a physical copy (from one of the campus bookstores or online), as it will be a great long-term resource for you.

How is the course graded?
(tentative) Assignments (every two weeks): 60%. Midterm: 20%. Final: 20%. The midterm will occur roughly halfway through the term. The final will cover only the material after the midterm. The assignments will consist of exercises from the book as well as performing computational experiments (running programs on data).

Course prerequisites
The course prerequisites are multivariable calculus (Math 2401 or Math 2411 or Math 2605; minimum grade of D) and the basics of programming (CS 1332 or CS 1372; minimum grade of D). If you are not very comfortable with calculus you will be lost. Light programming (using a language of your choice) will be required. If you cannot write real programs you won't be able to do a significant portion of the assignments. The ability to produce plots using some software of your choice (such as Matlab, though it need not be Matlab) is required. However, if for some reason you have all the needed background but didn't take the exact courses listed as prerequisites, talk to me and I will allow you in the course.