|
Computational Data Analysis:
FOUNDATIONS OF MACHINE LEARNING & DATA MINING
Fall 2009 graduate course
CSE 6740 A (88451), CSE 6740 Q (88651, Distance Learning), and ISYE 6740 A (89018)
Instructor:
Alexander Gray,
College of Computing, Georgia Tech
TA: Ravi Sastry, gmravi2003 at gatech dot edu
TA office hours: Fridays 5:30pm-7pm 1305 Klaus.
|
|
|
|
|
Syllabus (tentative)
| Date |
Topic |
Subtopics |
Reading and Assignments |
| Tue 8/18 |
What is Machine Learning? (Overview) |
Examples of ML; the parts of ML, tasks of ML,
"Machine learning" (ML) vs. "computational statistics" vs. "data mining"
vs. "pattern recognition", course overview, main themes |
Lecture 1
|
| Thu 8/20 |
What are Data? What is a Model?
(Probability, Estimation) |
Probability, random variables,
distributions; estimation, convergence and asymptotics |
Lecture 2
Statistics fundamentals [AOS 1.1-1.7, 2.1-2.10, 3.1-3.5, 5.3-5.4. PRML 1.2.1, 1.2.2, 1.2.4]
Convergence [AOS 5.1-5.2]
|
| |
|
14 ML concepts, illustrated by 22 ML methods |
|
| Thu 8/27 |
How Do I Learn a Density?
(Parametric Learning, Density Estimation, Generalization) |
Likelihood (a), mixture of Gaussians (MoG) (i), the EM
algorithm for MoG (b), generalization and cross-validation (c) |
Lecture 3
Parametric estimation [AOS 9.1]
MLE [AOS 9.3-9.5. ESL 8.2]
EM [ESL 8.5. PRML 9.2-9.3 (up to and including 9.3.1), 9.4]
Model selection, cross-validation [AOS 13.6, 22.8 (just CV part). ESL 7.1-7.5, 7.7-7.8]
|
| Tue 9/1 |
How Do I Learn Any Density? (Nonparametric Learning) |
Parametric vs. nonparametric estimation (d),
Sobolev and other spaces; error,
kernel density estimation (KDE) (ii), optimal kernels, KDE theory |
Lecture 4
Nonparametric estimation [AOS 6.2]
Loss, risk, MISE [AOS 12.1, 20.1. PRML 3.2]
Density estimation [AOS 20.2-20.3. ESL 6.6.1. PRML 2.5.1]
|
| Thu 9/3 |
How Do I Predict a Continuous Variable?
(Regression) |
Conditional density estimation and regression (e);
linear regression (iii), regularization (f),
ridge regression and LASSO (iv); sparse learning (g), Nadaraya-Watson and
local linear regression (v) |
Lecture 5
Linear regression [AOS 13.1-13.3. ESL 2.3.1, 3.1-3.3. PRML 3.1.1, 3.1.2]
Regularization [ESL 3.4. PRML 3.1.4]
Nadaraya-Watson [AOS 20.4. PRML 6.3.1]
Local linear regression [ESL 6.1]
|
| Tue 9/8 |
How Do I Predict a Discrete Variable? I
(Classification) |
Bayes classifier, naive Bayes (vi),
generative vs. discriminative (h); perceptron and neural network (vii), weight decay; logistic regression (viii) |
Lecture 6
Decision theory [AOS 22.1-22.2. ESL 2.3-2.5]
Linear discriminative [AOS 22.3-22.5. ESL 4.1-4.5, 11.3. PRML 4.1]
Generative [AOS 22.6. ESL 6.6, 6.8]
Neural Networks
[PRML 5.1-5.3. ESL 11.3-11.5]
|
| Thu 9/10 |
How Do I Do 'Industrial-Strength' Machine Learning? (FASTlib and MLPACK) |
Class project; FASTlib, MLPACK |
Lecture 7
FASTlib Tutorial ,
Kernel PCA example code
|
| Tue 9/15 |
How Do I Predict a Discrete Variable? II
(Classification) |
Margin maximization (i), linear support vector machine (SVM) (ix), SVM;
nearest-neighbor classifier (x) and theory; decision tree (xi) |
Lecture 8
Decision trees
[AOS 22.7. ESL 9.2. PRML 14.4]
SVM [AOS 22.9. ESL 12.1-12.3. PRML 7.1]
kNN [ESL 13.3]
|
| Thu 9/17 |
How Do I Reduce/Relate The Data Points?
(Association and Clustering) |
Association rules and market basket analysis (xii);
k-means (xiii), hierarchical clustering (xiv), mean-shift (xv)
|
Lecture 9
Association rules [ESL 14.2]
Clustering [ESL 14.3]
|
| Tue 9/22 |
How Do I Reduce/Relate the Features?
(Dimension Reduction) |
Principal components analysis (xvi), independence and non-Gaussianity (j), independent component analysis (xvii), manifolds (k),
isomap (xviii), local linear embedding (xix)
|
Lecture 10
Subset selection [ESL 3.4.1]
PCA,ICA, and MDS
[ESL 14.5-14.7. PRML 12.1, 12.4.1]
IsoMap and LLE
[ PRML 12.4.3]
Kernel PCA [PRML 12.3]
|
| Tue 9/30 |
How Do I Treat Temporal Data? I
(Time Series Analysis) |
Stationarity (l), seasonality (m),
ARMA, Kalman filters and state-space models (xx), particle filters (xxi)
|
Lecture 11
No reading from class texts
|
| Thu 10/1 |
How Do I Treat Temporal Data?
(Hidden Markov Models) |
Hidden Markov Models (xxii), dynamic programming (n)
|
Lecture 12
Rabiner's HMM Tutorial
Hidden Markov models [PRML 13.2]
|
| |
|
ML general frameworks and theory |
|
| Thu 10/8 |
How Can I Learn Fancier Models? I (Graphical Models) |
Graphical models
|
Lecture 13
Graphical models
[PRML 8.1-8.4.2. AOS 17.1-17.6, 18.1-18.4]
|
| Tue 10/20 |
How Can I Learn Fancier Models? II
(Kernelization) |
Kernelization, regularization theory,
reproducing kernel Hilbert spaces, nonlinear SVM
|
Lecture 14
Kernels [AOS 22.10. PRML 6.1-6.2]
Kernelizing SVM [ESL 12.3]
Regularization theory and RKHS [ESL 5.8]
|
| Thu 10/22 |
What Error Guarantees Can We Make?
(Learning Theory and Generalization) |
Inequalities, confidence bands, Vapnik-Chervonenkis theory |
Lecture 15
Probability bounds [AOS 4.1]
Estimation, confidence [AOS 6.3]
VC dimension
[AOS - 22.8 (just Prob. Ineq.). ESL - 7.9]
|
| Tue 10/27 |
What Loss Function Should I Use? I
(Maximum Likelihood and Robustness) |
Maximum likelihood estimation theory; Bayesian estimation, Bayesianism vs. frequentism
|
Lecture 16
ML estimation theory
[AOS 9.5, 9.7, 9.8, 9.10. ESL 8.2.2]
Bayesian inference
[AOS 11.1-11.7, 11.9. ESL 8.3. PRML 1.2.3]
|
| Thu 10/29 |
What Loss Function Should I Use? II
(Estimation Theory) |
robustness,
estimation, MoG; decision theory
|
Lecture 17
Robustness [ESL 10.6]
Statistical Decision Theory
[AOS 12.1-12.6. ESL 2.4]
|
| Tue 11/3 |
How Do I Ensure Generalization? (Model Selection) |
Resampling (cross-validation, bootstrap), model combination
|
Lecture 18
Bootstrap [AOS 8.1-8.3. ESL 7.11. ESL 8.2]
Bagging [ESL 8.7]
Model averaging and stacking [ESL 8.7-8.8]
|
| |
|
ML computational principles |
|
| Thu 11/5 |
How Do I Optimize The Parameters?
(General Optimization) |
Unconstrained vs. constrained/convex
optimization, derivative-free methods, first- and second-order methods,
sum-of-squares methods
|
Lecture 19
No reading from class texts
|
| Tue 11/10 |
How Do I Optimize Convex/Linear Functions?
(Convexity, Linear Algebra) |
EM algorithm; convexity; computational linear algebra as
optimization, matrix inversion for regression, singular value decomposition
(SVD) for dimensionality reduction |
Lecture 20
EM [ESL 8.5. PRML 9.2-9.3 (up to and including 9.3.1), 9.4]
|
| Thu 11/12 |
How Do I Optimize With Constraints?
(Constrained Optimization) |
Lagrange multipliers, the KKT
conditions, interior point method, SVM quadratic programming, SMO algorithm for SVM's, semidefinite programming |
Lecture 21
No reading from class texts
|
| Tue 11/17 |
How Do I Evaluate Deeply-Nested Sums?
(Graphical Model Inference) |
Elimination, belief propagation / sum-product
algorithm, junction-tree algorithm, variational methods
|
Lecture 22
No reading from class texts
|
| Thu 11/19 |
How Do I Evaluate Large Sums And Searches?
(Kernel Summation, Nearest Neighbor Search) |
Generalized N-body problems (GNPs),
hierarchical data structures, nearest-neighbor search, fast multipole methods
|
Lecture 23
No reading from class texts
|
| Tue 11/24 |
How Do I Evaluate High-Dimensional Integrals?
(Sampling) |
Monte Carlo integration, importance sampling,
Markov Chain Monte Carlo |
Lecture 24
No reading from class texts
|
| Tue 12/1 |
Project Presentations |
Presentations by the class |
Presentation Schedule on T-square
No reading from class texts
|
| Thu 12/3 |
Project Presentations |
Presentations by the class |
Presentation Schedule on T-square
No reading from class texts
|
|
|
Where and when
TuTh 1:35-2:55pm, 2447 Klaus.
First class: Monday 8/17/09.
My office hours: grab me right after a lecture.
Course mailing list:
http://www2.isye.gatech.edu/mailman/listinfo/isye6740a.
Books
Required: All of Statistics by Wasserman ("AOS"),
The Elements of Statistical
Learning by Hastie, Tibshirani, and Friedman ("ESL"),
Pattern Recognition and Machine Learning by Bishop ("PRML").
(Sorry -- these are all good books, but the scope of the course
requires all three. Good books are worth their weight in gold - you
shouldn't be cheap about buying technical books.)
What's this course about?
This course covers the fundamental theoretical concepts needed to
answer the practical questions which quickly arise in real data
analysis and is a PhD-level course which prepares students to do
research in machine learning. Machine learning, or pattern
recognition, or computational statistics, or data mining, is a huge
field with thousands of methods and mountains of theoretical ideas -
in fact statistics, the mathematics of data analysis, is by far the
largest area of mathematics. I will attempt to organize the many
ideas in ways that reveal the true underlying repeating themes and
separate issues which are truly separate. This course will pull
together many of the ideas I feel are most helpful in practice based
on my experience, some of which are not generally covered in standard
"machine learning" courses.
The course is designed to answer the most fundamental questions about
machine learning. Each lecture treats one main question, and several
cross-cutting questions will be answered along the way: How can we
conceptually organize the zoo of available methods, not to mention the
several different fields which all seem to be about data? What
are the most important methods to know about, and why? How can we
sometimes answer the question 'is this method better than this one'
using asymptotic theory? How can we sometimes answer the question 'is
this method better than this one' for a specific dataset of interest?
What can we say about the errors our method will make on future data?
What's the 'right' objective function? What does it mean to be
statistically rigorous? What are some ML people missing by not
knowing much about statistical theory? Should I be a Bayesian? What
is the source of certain 'religious' divides? What computer science
ideas can make ML methods tractable on modern large or complex
datasets? What are some questions that need answering in the
field?
Guest lectures
To help give you a complete picture all the way to professional research
practice, I may invite some esteemed machine learning researchers
to give talks on their research during the class period, as part of
the Yahoo-sponsored Georgia Tech machine learning seminar which I
organize.
How is the course graded? (tentative)
1. Assignments: 20%.
There will be assignments involving short answers, running existing code on provided datasets
(text/images, stock market, astronomy), mathematical derivations, and writing some code (see below) and pseudocode.
All programming assignments will be done using the FASTlib C++ library
which underlies MLPACK, a software collection for scalable machine
learning.
Each assignment will also have an
extra-credit creative component which explores an open research
question (i.e. a really good job on this part could result in a paper).
2. Project: 50%.
Once you have seen many examples of machine learning methods, you will choose one machine learning method that does not currently exist in MLPACK (based on a paper you find interesting, something you need for your thesis, etc), with my approval, and implement it using Fastlib. You'll demonstrate that it works on test cases by providing experimental results, and briefly present your work at the end of the class.
The best implementations from the class will result in additions
to MLPACK, contributing to the field of machine learning as well
as your CV.
3. Final exam on concepts/theory: 30%. This will cover the entire course
contain a large number of multiple-choice questions. The course will not
require that you prove theorems, though the questions may get at whether
you understand the proofs of some of the key theorems in the class.
Should you take this course?
What is the intended audience of this course? Anyone who wants to
competently apply statistical and machine learning methods to
real-world datasets, or design new methods, can benefit from
understanding the tools I will describe. For anyone who wants to do
research in machine learning, these are the foundations you need to
understand. It will satisfy both the CSE requirement and the Intelligent
Systems requirement of the CoC PhD program, and is one of the core
courses of the CSE PhD program.
How does this course relate to other course offerings? In general, it
focuses on laying comprehensive mathematical foundations; it has the
largest scope in terms of topical coverage, and is thus somewhat
intensive. It could have been called "All of Machine Learning" and
follows the same spirit as Wasserman's "All of Statistics". This
course can be thought of as extending the CoC undergraduate/graduate
Intro to Machine Learning course, although it is not strictly
necessary to take that course first. It will be helpful though -
machine learning is a big subject with many concepts, so seeing some
of the same material twice will not hurt at all. That course is a
gentler introduction covering a subset of the material in this course
without being as mathematical, but also covers some AI-related topics
not covered in this course, most notably reinforcement learning and
game theory. If AI is your main interest in machine learning rather
than data analysis, that course's perspective is going to be what you
want. In ISYE there are several courses which also cover a subset of
the methods covered in this course in more depth, from the point of
view of the culture of statistics, including courses called
computational statistics and data mining, and a brand new course on spatial
statistics which will be of interest for many applications. More
generally, this course gives a crash overview of fundamental
statistical theory, which is covered over many courses in ISYE. In
CoC there is a course called pattern recognition, which also covers a
subset of the methods covered in this course in more depth, from the
point of view of the culture of computer vision and electrical
engineering. Again, seeing some of the same material more than once
will not hurt you. There is a course on text mining -- this will be
useful for applying machine learning to text data. This course, by
contrast, is of a more general nature and does not discuss the special
aspects of text data. In CoC there is a theoretical course on
spectral methods, which includes discusson of some linear algebraic
machine learning methods. In the Math department there is a class on
Learning Theory, a sub-topic of machine learning which focuses on
theoretical bounds, mainly for the classification problem. In this
class I will briefly cover learning theory but will only hit the
highlights which inform practice - the Math course is for you if you
want to master proving those kinds of theorems (and you have a
background in Real Analysis). All of the aforementioned courses are
valuable, so I encourage checking them out.
How hard is this course? This course is designed for the PhD level.
MS students and advanced undergraduate students (who generally have a
heavy courseload) are warned of the time commitment required by this
course, due to the mathematical nature of the material as well as the
time it takes to read about and implement machine learning
methods in the experimental part of the course. Otherwise, if you
have the time to put in and the true desire to gain expertise in this
topic, students of any level and in any discipline are welcome.
Auditing is fine with me but I don't recommend it, due to the pace of
the technical concepts. There are no requirements for auditing, except attending lectures.
What background is needed for this course? I will assume very little
specific prior knowledge, aside from basic math familiarity (calculus,
matrices, probability) and general principles of computer science
(algorithms, data structures); however since this is a PhD-level
course I'll assume a certain level of general technical maturity (in
other words don't expect the material to be trivial). Familiarity
with machine learning will be helpful but is not necessary.
|
|
|