|
FOUNDATIONS OF
MACHINE LEARNING & DATA MINING
Spring 2006 graduate course
CS 8803-MDM
Alexander Gray
College of Computing, Georgia Tech
|
|
ABOUT THE FINAL EXAM:
- Note that the final is 50% of the course grade, and the project is
the other 50%. Auditing students do not need to take the final exam.
- The final will be closed-book, closed-notes.
Bring only your brain and something to write with.
- There will be
no mathematical derivations or computations to do. You will not need
to have equations exactly memorized. You WILL be expected to know the
names of things and terminology. The main thing I will test is
whether you understand the concepts. Pay attention to how I have
organized the material in the course - this will help you remember
where things fit conceptually.
- All of the course will be covered.
The material will be restricted to what was in the lecture notes.
However, if you didn't understand what was in the lecture notes you
may have needed to read other sources or ask the instructor. There
will be many many questions, all of which will be easy if you
understood all the concepts in the class.
- Here are some example questions: Q: What is the purpose of
cross-validation? Q: Name one method for nonparametric regression.
Q: What is the difference between classification and regression? Q:
What's another name for classification? Q: The error of the
nearest-neighbor rule is asymptotically no more than twice what? Q:
How is the risk related to a loss function?
What to turn in:
- 10-minute presentation, consisting of:
- intro to problem
- previous/related work
- describe approach
- results
- future work, including publication plans, if any
- tech report version of the above material, due sunday 5/7, midnight
- minimum 3 pages, latex article format
- final exam (friday 5/5, 107 Instructional Center, 11:30-2:20)
Note (2/14): Hamming's lecture is here.
Check it out. It could be the most important thing you read in this class.
Note (1/27): The project list is here.
If you are doing the project,
please send me your background info as listed on the last
slide of the first lecture, which is downloadable from this webpage. Please
be specific about courses you have taken in each area. Then sign up for a
meeting with me on the swiki so we can discuss your project.
|
|
Where and when
TuTh 1:30-3pm, Instructional Center 107. First class: Tuesday 1/10/06.
Topic
This course covers the fundamental concepts needed to answer the
practical questions which quickly arise in real data analysis and
is a PhD-level course which prepares students to do research in
machine learning. Machine learning, or pattern recognition, or
computational statistics, or data mining, is a huge field with
thousands of methods and mountains of theoretical ideas - in fact
statistics, the mathematics of data analysis, is by far the largest
area of mathematics. I will attempt to organize the many ideas in
ways that reveal the true underlying repeating themes and separate
issues which are truly separate. This course will pull together many
of the ideas I feel are most helpful in practice based on my
experience, a number of which lie outside the current culture of "machine
learning" and are not described in any existing machine learning
course, to my knowledge.
The course is organized around such questions as: How can we break down the
zoo of available methods? How can we sometimes answer the question
'is this method better than this one' using asymptotic theory? How
can we sometimes answer the question 'is this method better than this
one' using finite datasets? What can we say about the errors our
method will make on future data? What's so special about maximum
likelihood? What's the 'right' objective function? What's the most
reliable way to perform model selection? What does it mean to be
statistically rigorous? What are ML people missing by not knowing
much about statistics? Should I be a Bayesian? What is the source of
certain 'religious' divides? What computer science ideas can make ML
methods tractable on particularly large or complex datasets? What are
some questions that need answering in the field?
Course structure
There will be two exams, a midterm exam worth 25% and a final exam
worth 25%. 50% of the course grade will be based on a serious
course-long project which will be decided upon and begun by the second
or third week. You'll either propose one to me or select one from a
rich list that I'll provide. Learning by doing is the only way to
learn something non-trivial. The goal is for you to produce a
publication-quality paper which you will give a 20-minute talk on at
the end of the class and submit somewhere. I will give you as much
one-on-one time with me as I can given the large number of students
enrolled in the class. I will leave one entire day per week free for
one-on-one appointments which you can make as you need them. You can
also drop in during that day if you have a quick question for which
email does not suffice. Every week you will email me a report on your
progress on the project for that week. Warning: I am serious about
the project being publication-quality, and my time spent meeting with
you is valuable. Therefore, I will expect a fair amount of work from
you every week. If this scares you, don't take this course. If
you are taking more than one other course, you will probably not be
able to meet these expectations.
The projects I have in mind consist of basic but open questions in the
field that I feel should be investigated or low-hanging fruit that
someone should do. If you want to propose a project, the main rule is
that must be of sufficient machine learning depth and something where
my advice will help you. I'm happy for it to be something related to
your thesis or research outside the class. For students with little
or moderate previous exposure to machine learning, some examples of
good projects include applications to ML to new problems and empirical
comparisons of recent ML methods for a specific problem. It will be
possible, for some projects, to work in teams. However, each student
will email me their weekly progress report separately, cc'ing all the
teammates. In the first week you will email me some details about
your background and either some selections from my project list or a
very brief description of your proposed project. In our first
meeting, in the second or third week, we will discuss and preliminarily
settle on your project and its goals. You'll do preliminary work on
this, during which time we may decide alter the project's goals or switch
to another project depending on how the preliminary work went.
Right after the midterm exam we'll finalize your project and its goals,
and the project officially begins then.
Intended Audience
This course is for PhD students (in any discipline) only. Students
of any level may audit. Anyone who wants to competently apply or
design statistical and machine learning methods on real-world datasets
can benefit from understanding the tools I will describe. For anyone
who wants to do research in machine learning, this course is
definitely for you. This course can be thought of as extending the
graduate Intro to Machine Learning course, although it is not strictly
necessary to take that course first (it will be very helpful though).
In that course you'll learn about and implement several machine
learning methods, which is highly recommended. In this course I'll
focus on foundational ideas and only cover actual methods very briefly
as special cases of more general frameworks and theoretical
approaches. In the Math department there is a new class being taught
on Learning Theory, a sub-topic of machine learning which focuses
mainly on theoretical bounds for the classification problem. In my
class I will briefly cover learning theory but will only hit the highlights
which inform practice - if you want to really master proving those
kinds of elegant theorems (and you have a background in Real Analysis)
check out that course.
I will assume very little specific prior knowledge, aside from basic
math familiarity (calculus, matrices, probability, random variables)
and general principles of computer science (algorithms, data
structures); however since this is a PhD-level course I'll assume a
certain level of general technical maturity (in other words don't
expect the material to be trivial). Familiarity with machine learning
will be very helpful but is not strictly necessary. I anticipate and
welcome participants with diverse backgrounds and needs, including
audits. However, I think the project part of the class will be a very
unique and powerful apprenticeship-style learning opportunity, so I
highly recommend taking on this commitment. I will try to make sure
you get something out of the project, including a solid Georgia Tech
technical report or a conference publication if you are willing to
commit your time and energy. Again, if you are unwilling or unable to
commit enough time and energy, I highly recommend NOT taking this
course -- you can audit it or wait until a future offering of this
course.
|
Scheduling meetings
You make an appointment with me every week yourself on the schedule
swiki. Please only make
appointments when you feel that email communication or grabbing me right after
class are not sufficient. You can make a 15-, 30-, or 45-minute appointment
(by signing up for contiguous 15-minute segments)
but please reserve only the amount of time you think you'll need. I expect you
will only need to meet with me occasionally, not every week.
Books
The tests will be based on the material in the lectures. However, it will
be very helpful to you to read the more extensive presentation of this
material in books. Much of the
first third of the course (statistical principles)
can be found in All of Statistics by Wasserman.
Much of the second third of this
course (methods and models) can be found in The Elements of Statistical
Learning by Hastie, Tibshirani, and Friedman. I recommend buying both
of these books.
Syllabus
|
|
|