FOUNDATIONS OF
MACHINE LEARNING & DATA MINING

Spring 2006 graduate course
CS 8803-MDM

Alexander Gray
College of Computing, Georgia Tech
pdf
ABOUT THE FINAL EXAM:
  • Note that the final is 50% of the course grade, and the project is the other 50%. Auditing students do not need to take the final exam.
  • The final will be closed-book, closed-notes. Bring only your brain and something to write with.
  • There will be no mathematical derivations or computations to do. You will not need to have equations exactly memorized. You WILL be expected to know the names of things and terminology. The main thing I will test is whether you understand the concepts. Pay attention to how I have organized the material in the course - this will help you remember where things fit conceptually.
  • All of the course will be covered. The material will be restricted to what was in the lecture notes. However, if you didn't understand what was in the lecture notes you may have needed to read other sources or ask the instructor. There will be many many questions, all of which will be easy if you understood all the concepts in the class.
  • Here are some example questions: Q: What is the purpose of cross-validation? Q: Name one method for nonparametric regression. Q: What is the difference between classification and regression? Q: What's another name for classification? Q: The error of the nearest-neighbor rule is asymptotically no more than twice what? Q: How is the risk related to a loss function?
What to turn in:
  • 10-minute presentation, consisting of:
    • intro to problem
    • previous/related work
    • describe approach
    • results
    • future work, including publication plans, if any
  • tech report version of the above material, due sunday 5/7, midnight
    • minimum 3 pages, latex article format
  • final exam (friday 5/5, 107 Instructional Center, 11:30-2:20)
Note (2/14): Hamming's lecture is here. Check it out. It could be the most important thing you read in this class.

Note (1/27): The project list is here. If you are doing the project, please send me your background info as listed on the last slide of the first lecture, which is downloadable from this webpage. Please be specific about courses you have taken in each area. Then sign up for a meeting with me on the swiki so we can discuss your project.
Where and when
TuTh 1:30-3pm, Instructional Center 107. First class: Tuesday 1/10/06.

Topic
This course covers the fundamental concepts needed to answer the practical questions which quickly arise in real data analysis and is a PhD-level course which prepares students to do research in machine learning. Machine learning, or pattern recognition, or computational statistics, or data mining, is a huge field with thousands of methods and mountains of theoretical ideas - in fact statistics, the mathematics of data analysis, is by far the largest area of mathematics. I will attempt to organize the many ideas in ways that reveal the true underlying repeating themes and separate issues which are truly separate. This course will pull together many of the ideas I feel are most helpful in practice based on my experience, a number of which lie outside the current culture of "machine learning" and are not described in any existing machine learning course, to my knowledge.

The course is organized around such questions as: How can we break down the zoo of available methods? How can we sometimes answer the question 'is this method better than this one' using asymptotic theory? How can we sometimes answer the question 'is this method better than this one' using finite datasets? What can we say about the errors our method will make on future data? What's so special about maximum likelihood? What's the 'right' objective function? What's the most reliable way to perform model selection? What does it mean to be statistically rigorous? What are ML people missing by not knowing much about statistics? Should I be a Bayesian? What is the source of certain 'religious' divides? What computer science ideas can make ML methods tractable on particularly large or complex datasets? What are some questions that need answering in the field?

Course structure
There will be two exams, a midterm exam worth 25% and a final exam worth 25%. 50% of the course grade will be based on a serious course-long project which will be decided upon and begun by the second or third week. You'll either propose one to me or select one from a rich list that I'll provide. Learning by doing is the only way to learn something non-trivial. The goal is for you to produce a publication-quality paper which you will give a 20-minute talk on at the end of the class and submit somewhere. I will give you as much one-on-one time with me as I can given the large number of students enrolled in the class. I will leave one entire day per week free for one-on-one appointments which you can make as you need them. You can also drop in during that day if you have a quick question for which email does not suffice. Every week you will email me a report on your progress on the project for that week. Warning: I am serious about the project being publication-quality, and my time spent meeting with you is valuable. Therefore, I will expect a fair amount of work from you every week. If this scares you, don't take this course. If you are taking more than one other course, you will probably not be able to meet these expectations.

The projects I have in mind consist of basic but open questions in the field that I feel should be investigated or low-hanging fruit that someone should do. If you want to propose a project, the main rule is that must be of sufficient machine learning depth and something where my advice will help you. I'm happy for it to be something related to your thesis or research outside the class. For students with little or moderate previous exposure to machine learning, some examples of good projects include applications to ML to new problems and empirical comparisons of recent ML methods for a specific problem. It will be possible, for some projects, to work in teams. However, each student will email me their weekly progress report separately, cc'ing all the teammates. In the first week you will email me some details about your background and either some selections from my project list or a very brief description of your proposed project. In our first meeting, in the second or third week, we will discuss and preliminarily settle on your project and its goals. You'll do preliminary work on this, during which time we may decide alter the project's goals or switch to another project depending on how the preliminary work went. Right after the midterm exam we'll finalize your project and its goals, and the project officially begins then.

Intended Audience
This course is for PhD students (in any discipline) only. Students of any level may audit. Anyone who wants to competently apply or design statistical and machine learning methods on real-world datasets can benefit from understanding the tools I will describe. For anyone who wants to do research in machine learning, this course is definitely for you. This course can be thought of as extending the graduate Intro to Machine Learning course, although it is not strictly necessary to take that course first (it will be very helpful though). In that course you'll learn about and implement several machine learning methods, which is highly recommended. In this course I'll focus on foundational ideas and only cover actual methods very briefly as special cases of more general frameworks and theoretical approaches. In the Math department there is a new class being taught on Learning Theory, a sub-topic of machine learning which focuses mainly on theoretical bounds for the classification problem. In my class I will briefly cover learning theory but will only hit the highlights which inform practice - if you want to really master proving those kinds of elegant theorems (and you have a background in Real Analysis) check out that course.

I will assume very little specific prior knowledge, aside from basic math familiarity (calculus, matrices, probability, random variables) and general principles of computer science (algorithms, data structures); however since this is a PhD-level course I'll assume a certain level of general technical maturity (in other words don't expect the material to be trivial). Familiarity with machine learning will be very helpful but is not strictly necessary. I anticipate and welcome participants with diverse backgrounds and needs, including audits. However, I think the project part of the class will be a very unique and powerful apprenticeship-style learning opportunity, so I highly recommend taking on this commitment. I will try to make sure you get something out of the project, including a solid Georgia Tech technical report or a conference publication if you are willing to commit your time and energy. Again, if you are unwilling or unable to commit enough time and energy, I highly recommend NOT taking this course -- you can audit it or wait until a future offering of this course.
Scheduling meetings
You make an appointment with me every week yourself on the schedule swiki. Please only make appointments when you feel that email communication or grabbing me right after class are not sufficient. You can make a 15-, 30-, or 45-minute appointment (by signing up for contiguous 15-minute segments) but please reserve only the amount of time you think you'll need. I expect you will only need to meet with me occasionally, not every week.

Books
The tests will be based on the material in the lectures. However, it will be very helpful to you to read the more extensive presentation of this material in books. Much of the first third of the course (statistical principles) can be found in All of Statistics by Wasserman. Much of the second third of this course (methods and models) can be found in The Elements of Statistical Learning by Hastie, Tibshirani, and Friedman. I recommend buying both of these books.

Syllabus
LectureTopic
Statistical principles
1 Machine learning overview, course logistics
2 Probability and estimation
3 Parametric estimation
4 Nonparametric estimation
5 Bayesian estimation
6 Selecting estimators and models I
7 Selecting estimators and models II
8 Selecting estimators and models III
Methods and models
9 Loss functions, density estimation
10 Regression
11 Classification I
12 Classification II
13 Model frameworks I: graphical models
14 Model frameworks II: kernel methods
15 Unsupervised learning I
16 Unsupervised learning II
17 Unsupervised learning III
18 Reinforcement learning I
19 Reinforcement learning II
Computational principles
20 Doing research, Parameter optimization I
21 Parameter optimization II
22 Parameter optimization III
23 Parameter optimization IV
24 Graphical model inference I
25 Graphical model inference II
26 Integration and sampling
27 Generalized N-body problems
Project presentations
28 Project presentations (thu 4/20)
29 Project presentations (tue 4/25)
30 Project presentations (thu 4/27)
31 Final exam (fri 5/5 11:30-2:20, Instructional Center 107)