Computational Data Analysis:
FOUNDATIONS OF MACHINE LEARNING & DATA MINING

Fall 2008 graduate course
CSE 6740 A (88977), ISYE 6740 A (89574), and CSE 6740 Q (89194, Distance Learning)

Instructor: Alexander Gray, College of Computing, Georgia Tech
TA: Nishant Mehta, niche at cc dot gatech dot edu
Syllabus (tentative)
Topic Subtopics Reading
What is Machine Learning? (Overview) Examples of ML; the parts of ML, tasks of ML, "Machine learning" (ML) vs. "computational statistics" vs. "data mining" vs. "pattern recognition", course overview, main themes Lecture 1
Basic concepts of ML, illustrated by 10 ML methods                                                          
How Do I Learn a Simple Model? (Probability and Inference) Probability, random variables, distributions; estimation, convergence and asymptotics, confidence intervals, probability bounds Lecture 2, Lecture 3
Statistics fundamentals
[AOS 1.1-1.7, 2.1-2.10, 3.1-3.5, 5.3-5.4. PRML 1.2.1, 1.2.2, 1.2.4]
Convergence [AOS 5.1-5.2]
Probability bounds [AOS 4.1]
Estimation, confidence [AOS 6.3]
How Do I Learn a Mixture of Gaussians? (Parametric Estimation) Likelihood, mixture of Gaussians (MoG), the EM algorithm for MoG (i); generalization Lecture 4
Parametric estimation [AOS 9.1]
MLE [AOS 9.3-9.5. ESL 8.2]
EM [ESL 8.5. PRML 9.2-9.3 (up to and including 9.3.1), 9.4]
How Do I Learn Any Density?
(Model Selection and Nonparametric Estimation)
Model selection, cross-validation, parametric vs. nonparametric estimation, Sobolev and other spaces; L2 error, kernel density estimation (KDE) (ii), optimal kernels, KDE theory Lecture 5, Lecture 6
Model selection, CV
[AOS 13.6, 22.8 (just CV part). ESL 7.1-7.5, 7.7-7.8]
Nonparametric estimation [AOS 6.2]
Loss, risk, MISE [AOS 12.1, 20.1. PRML 3.2]
Density estimation
[AOS 20.2-20.3. ESL 6.6.1. PRML 2.5.1]
Problem Set 1 - Due October 3 at 5PM FASTlib Lecture, FASTlib Tutorial
Kernel PCA example code
How Do I Predict a Continuous Variable? (Regression) Conditional density estimation; linear regression (iii), regularization, ridge regression and LASSO (iv); Nadaraya-Watson and local linear regression (v) Lecture 6, Lecture 7
Linear regression
[AOS 13.1-13.3. ESL 2.3.1, 3.1-3.3. PRML 3.1.1, 3.1.2]
Regularization [ESL 3.4. PRML 3.1.4]
Nadaraya-Watson [AOS 20.4. PRML 6.3.1]
Local linear regression [ESL 6.1]
How Do I Predict a Discrete Variable? (Classification) Bayes classifier, naive Bayes (vi), generative vs. discriminative; perceptron (vii), weight decay, linear support vector machine (SVM) (viii); nearest-neighbor classifier (ix) and theory; decision tree (x) Lecture 7, Lecture 8
Decision theory [AOS 22.1-22.2. ESL 2.3-2.5]
Linear discriminative [AOS 22.3-22.5. ESL 4.1-4.5, 11.3. PRML 4.1]
Generative [AOS 22.6. ESL 6.6, 6.8]
Decision trees
[AOS 22.7. ESL 9.2. PRML 14.4]
SVM [AOS 22.9. ESL 12.1-12.3. PRML 7.1]
kNN [ESL 13.3]
What Loss Function Should I Use? (Maximum Likelihood and Bayesian Inference) Maximum likelihood estimation theory; Bayesian estimation, Bayesianism vs. frequentism Lecture 9, Lecture 10
ML estimation theory
[AOS 9.5, 9.7, 9.8, 9.10. ESL 8.2.2]
Bayesian inference
[AOS 11.1-11.7, 11.9. ESL 8.3. PRML 1.2.3]
What Loss Function Should I Use? (Estimation Theory) Robustness, L2 estimation, L2 MoG; decision theory Lecture 12
Statistical Decision Theory
[AOS 12.1-12.6. ESL 2.4]
Robustness [ESL 10.6]
What Model (Parameters) Should I Use? (Learning Theory and Generalization) Vapnik-Chervonenkis theory; the bootstrap; ensemble methods: bagging, stacking, boosting Lecture 13
VC dimension
[AOS - 22.8 (just Prob. Ineq.). ESL - 7.9]
EDF [AOS 7.1-7.2].
Bootstrap [AOS 8.1-8.3. ESL 7.11. ESL 8.2]
Bagging [ESL 8.7]
Model averaging and stacking [ESL 8.7-8.8]
How Can I Learn Fancier (Nonlinear) Models? (Kernelization) Kernelization, regularization theory, reproducing kernel Hilbert spaces, nonlinear SVM Lecture 14
Kernels [AOS 22.10. PRML 6.1-6.2]
Kernelizing SVM [ESL 12.3]
Regularization theory and RKHS [ESL 5.8].
How Can I Learn Fancier (Compositional) Models? (Trees, Networks, and Graphical Models) Neural networks Lecture 15
Neural Networks
[PRML 5.1-5.3. ESL 11.3-11.5]
How Can I Learn Fancier (Compositional) Models? (Trees, Networks, and Graphical Models) Graphical models Lecture 16
Graphical models
[PRML 8.1-8.4.2. AOS 17.1-17.6, 18.1-18.4]
Further common ML problems and solutions
How Do I Reduce/Relate The Data Points? (Association and Clustering) Association rules and market basket analysis; parametric and nonparametric clustering Lecture 17
Association rules [ESL 14.2]
Clustering [ESL 14.3]
How Do I Reduce/Relate the Features? (Dimension Reduction) Principal components analysis, Independent component analysis, multidimensional scaling, manifold learning Lecture 18
Subset selection [ESL 3.4.1]
PCA,ICA, and MDS
[ESL 14.5-14.7. PRML 12.1, 12.4.1]
IsoMap and LLE [ PRML 12.4.3]
Kernel PCA [PRML 12.3]
Problem Set 2 - Due November 10 at 5PM PS 2, Datasets
How Do I Treat Temporal Data? (Time Series Analysis) Stationarity, seasonality, ARMA, Kalman filters and state-space models Lecture 19
No reading from class texts
How Do I Treat Temporal Data? (Hidden Markov Models) Hidden Markov Models: models, inference, learning Lecture 20
Rabiner's HMM Tutorial
Hidden Markov models [PRML 13.2]
How Do I Learn Actions From Data? (Reinforcement Learning) Markov decision processes, policy and value iteration, Q-learning, prioritized sweeping Lecture 21
No reading from class texts
General computational frameworks for ML
How Do I Optimize The Parameters? (Unconstrained Optimization) Unconstrained vs. constrained/convex optimization, derivative-free methods, first- and second-order methods, sum-of-squares methods Lecture 22
No reading from class texts
How Do I Optimize Convex/Linear Functions? (Unconstrained Linear Optimization) Convexity, computational linear algebra as optimization, matrix inversion for regression, singular value decomposition (SVD) for dimensionality reduction, stochastic gradient descent Lecture 23
No reading from class texts
How Do I Optimize With Constraints? (Constrained Optimization) Lagrange multipliers, the KKT conditions, interior point method, SMO algorithm for SVM's Lecture 24
No reading from class texts
How Do I Evaluate High-Dimensional Integrals? (Sampling) Monte Carlo integration, importance sampling, Markov Chain Monte Carlo Lecture 25
No reading from class texts
How Do I Evaluate Deeply-Nested Sums? (Graphical Model Inference) Elimination, belief propagation / sum-product algorithm, junction-tree algorithm, variational methods Lecture 26
No reading from class texts
How Do I Evaluate Large Sums And Searches? Generalized N-body problems (GNPs), hierarchical data structures, nearest-neighbor search, fast multipole methods Lecture 27
No reading from class texts
Real-world application of ML
How Do I Apply All This In The Real World? Exploratory data analysis and information visualization; modeling choices; prior knowledge and assumptions; non-ideal data; evaluation and interpretation; where the research problems in ML are
Where and when
MWF 10:05am - 10:55am, 2447 Klaus. First class: Wednesday 8/20/08. My office hours: grab me right after a lecture. Course mailing list: http://www2.isye.gatech.edu/mailman/listinfo/isye6740a.

Books
Required: All of Statistics by Wasserman ("AOS"), The Elements of Statistical Learning by Hastie, Tibshirani, and Friedman ("ESL"), Pattern Recognition and Machine Learning by Bishop ("PRML"). (Sorry -- these are all good books, but the scope of the course requires all three. Good books are worth their weight in gold - you shouldn't be cheap about buying technical books.)

What's this course about?
This course covers the fundamental theoretical concepts needed to answer the practical questions which quickly arise in real data analysis and is a PhD-level course which prepares students to do research in machine learning. Machine learning, or pattern recognition, or computational statistics, or data mining, is a huge field with thousands of methods and mountains of theoretical ideas - in fact statistics, the mathematics of data analysis, is by far the largest area of mathematics. I will attempt to organize the many ideas in ways that reveal the true underlying repeating themes and separate issues which are truly separate. This course will pull together many of the ideas I feel are most helpful in practice based on my experience, a number of which lie outside the current culture of "machine learning" and are not described in any existing machine learning course, to my knowledge.

The course is designed to answer the most fundamental questions about machine learning. Each lecture treats one main question, and several cross-cutting questions will be answered along the way: How can we conceptually organize the zoo of available methods, not to mention the several different fields which all seem to be about data? What are the most important methods to know about, and why? How can we sometimes answer the question 'is this method better than this one' using asymptotic theory? How can we sometimes answer the question 'is this method better than this one' for a specific dataset of interest? What can we say about the errors our method will make on future data? What's the 'right' objective function? What does it mean to be statistically rigorous? What are some ML people missing by not knowing much about statistical theory? Should I be a Bayesian? What is the source of certain 'religious' divides? What computer science ideas can make ML methods tractable on modern large or complex datasets? What are some questions that need answering in the field?

Guest lectures
To help give you a complete picture all the way to professional research practice, I may invite some esteemed machine learning researchers to give talks on their research during the class period, as part of the Yahoo-sponsored Georgia Tech machine learning seminar which I organize.

How is the course graded? (tentative)
1. Assignments (implementations/experiments): 75%. Beginning somewhere after the 8th lecture (once you have the necesssary background), you will start implementing machine learning methods in Matlab and/or C++, testing your implementations and those of other students in the class, and comparing them on reference datasets (text/images, stock market, astronomy). Each assignment will also have an extra-credit creative component which explores an open research question (i.e. a really good job on this part could result in a paper). All programming assignments will be done using the FASTlib C++ library which underlies MLPACK, a software collection for scalable machine learning which will be publicly released in December 2008. The best implementations from the class may result in additions to MLPACK, contributing to the field of machine learning as well as your CV.

2. Final exam on theory: 25%. This will cover the entire course and contain short-answer and multiple-choice questions. The course will not require that you prove theorems, though the questions may get at whether you understand the proofs of some of the key theorems in the class.

Should you take this course?
What is the intended audience of this course? Anyone who wants to competently apply statistical and machine learning methods to real-world datasets, or design new methods, can benefit from understanding the tools I will describe. For anyone who wants to do research in machine learning, these are the foundations you need to understand. It will satisfy both the CSE requirement and the Intelligent Systems requirement of the CoC PhD program, and is one of the core courses of the CSE PhD program.

How does this course relate to other course offerings? In general, it focuses on laying comprehensive mathematical foundations; it has the largest scope in terms of topical coverage, and is thus somewhat intensive. It could have been called "All of Machine Learning" and follows the same spirit as Wasserman's "All of Statistics". This course can be thought of as extending the CoC undergraduate/graduate Intro to Machine Learning course, although it is not strictly necessary to take that course first. It will be helpful though - machine learning is a big subject with many concepts, so seeing some of the same material twice will not hurt at all. That course is a gentler introduction covering a subset of the material in this course without being as mathematical, but also covers some AI-related topics not covered in this course, most notably reinforcement learning and game theory. If AI is your main interest in machine learning rather than data analysis, that course's perspective is going to be what you want. In ISYE there are several courses which also cover a subset of the methods covered in this course in more depth, from the point of view of the culture of statistics, including courses called computational statistics and data mining, and a brand new course on spatial statistics which will be of interest for many applications. More generally, this course gives a crash overview of fundamental statistical theory, which is covered over many courses in ISYE. In CoC there is a course called pattern recognition, which also covers a subset of the methods covered in this course in more depth, from the point of view of the culture of computer vision and electrical engineering. Again, seeing some of the same material more than once will not hurt you. There is a course on text mining -- this will be useful for applying machine learning to text data. This course, by contrast, is of a more general nature and does not discuss the special aspects of text data. In CoC there is a theoretical course on spectral methods, which includes discusson of some linear algebraic machine learning methods. In the Math department there is a class on Learning Theory, a sub-topic of machine learning which focuses on theoretical bounds, mainly for the classification problem. In this class I will briefly cover learning theory but will only hit the highlights which inform practice - the Math course is for you if you want to master proving those kinds of theorems (and you have a background in Real Analysis). All of the aforementioned courses are valuable, so I encourage checking them out.

How hard is this course? This course is designed for the PhD level. MS students and advanced undergraduate students (who generally have a heavy courseload) are warned of the time commitment required by this course, due to the mathematical nature of the material as well as the time it takes to read about and implement many machine learning methods in the experimental part of the course. Otherwise, if you have the time to put in and the true desire to gain expertise in this topic, students of any level and in any discipline are welcome. Auditing is fine with me but I don't recommend it, due to the pace of the technical concepts.

What background is needed for this course? I will assume very little specific prior knowledge, aside from basic math familiarity (calculus, matrices, probability) and general principles of computer science (algorithms, data structures); however since this is a PhD-level course I'll assume a certain level of general technical maturity (in other words don't expect the material to be trivial). Familiarity with machine learning will be helpful but is not necessary.