CS x641
Machine Learning
Assignment #1
Supervised Learning

Numbers

Due: February 8, 2008 23:59:59 EST
Please submit via tsquare.

The assignment is worth 12% of your final grade.

Why?

The purpose of this project is to explore some techniques in supervised learning. It is important to realize that understanding an algorithm or technique requires understanding how it behaves under a variety of circumstances. As such, you will be asked to implement some simple learning algorithms, and to compare their performance.

You may program in any language that you wish (we do prefer java, matlab, Lisp or C++; let us know beforehand if you're going to use something else). In any case it is your responsibility to make sure that the code runs on the standard CoC linux boxes. Re-read that last sentence.

Read everything below carefully!

The Problems Given to You

You should implement five learning algorithms. They are for:

Each algorithm is described in detail in your textbook, the handouts, and all over the web. In fact, instead of implementing the algorithms yourself, you may use software packages that you find elsewhere; however, if you do so you should provide proper attribution. Also, you will note that you have to do some fiddling to get good results, graphs and such, so even if you use another's package, you may need to be able to modify it in various ways.

Decision Trees. For the decision tree, you should implement or steal a decision tree algorithm. Be sure to use some form of pruning. You are not required to use information gain (for example, there is something called the GINI index that is sometimes used) to split attributes, but you should describe whatever it is that you do use.

Neural Networks. For the neural network you should implement or steal your favorite kind of network and training algorithm. You may use networks of nodes with as many layers as you like and any activation function you see fit.

Boosting. Implement a boosted version of your decision trees. As before, you will want to use some form of pruning, but presumably because you're using boosting you can afford to be much more aggressive about your pruning.

Support Vector Machines. You should implement (for sufficently loose definitions of implement including "download") SVMs. This should be done in such a way that you can swap out kernel functions. I'd like to see at least two.

k-Nearest Neighbors. You should implement kNN. Use different values of k.

Testing. In addition to implementing the algorithms described above, you should design two interesting classification problems. For the purposes of this assignment, a "classification problem" is just a set of training examples and a set of test examples. I don't care where you get the data. You can download some, take some from your own research, or make some up on your own. Be careful about the data you choose, though. You'll have to explain why they are interesting, use them in later assignments, and come to really care about them.

What to Turn In

You must submit via tsquare a tar or zip file named yourgtaccount.{zip,tar,tar.gz} that contains a single folder or directory named yourgtaccount. That directory in turn contains:

  1. a file named README.txt containing instructions for running your code
  2. your code (see note below)
  3. a file named analysis.pdf containing your writeup
  4. any supporting files you need, such as your training and test sets (see note below).

Note below: if the data are way, way, too huge for submitting, see if you can arrange for an URL. This also goes for code, too. Mailing me all of Weka isn't necessary, for example, because I can get it myself; however, you should at least send me any files you found necessary to change. In any case, include all the information in README.txt

The file analysis.pdf should contain:

One example of decent analysis is available on the handouts page.

Grading Criteria

You are being graded on your analysis more than anything else. Roughly speaking, implementing everything and getting it to run is worth maybe 10% of the grade on this assignment. Of course analysis without proof of working code makes the analysis suspect.

The key thing is that your explanations should be both thorough and concise. Imagine you are writing a paper for the major conference in your field the year you will be graduating and you need to impress all those folks who will be deciding whether to interview you later. You don't want them to think you're shallow do you? Or that you're incapable of coming up with interesting classification problems, right? And you surely don't want them to think that you make up for a lack of content by blathering on about irrelevant aspects of your work? Of course not.

Finally, I'd like to point out that I am very particular about the format of the assignments. Follow the directions carefully. Failure to turn in files without the proper naming scheme, or anything else that makes the grader's life unduly hard is simply going to lead to an ignored assignment. I am remarkably inflexible about this. Also, there will be no late assignments accepted, so start now. Have fun. One day you'll look back on this and smile.