CS 4641 Project 1: Supervised Learning

Due: 2012-06-11

In this project, you will apply three algorithms to two data sets. The data and the algorithms are provided by Weka. Please answer each question in the order they appear. Do not skip to later steps to answer earlier questions that ask you to predict outcomes based on your analysis of the data and understanding of the algorithms.

Submit your report, titled <yourgtname>-project1.pdf, as an attachment to a submission to Project 1 on T-Square by midnight on the due date. T-Square will be set to accept repeat submissions.

The Data Sets

  1. Contact Lens data
    • $WEKA_HOME/data/contact-lenses.arff
  1. Iris data
    • $WEKA_HOME/data/iris.arff

The Algorithms

  1. K-Nearest Neighbors
    • classifiers/lazy/IBk
  2. Decision Trees
    • classifiers/trees/Id3
    • classifiers/trees/J48
  3. Neural Networks
    • classifiers/functions/MultilayerPerceptron

The Report

Copy and paste the text below into a text editor or use this LaTeX template: yourgtname-project1.tex . Submit only a PDF. Sending me a MS Word document is equivalent to sending me nothing.

Data Sets

What are the differences between the two data sets?

Which algorithm do you expect to perform best on the Contact Lens data data? Why?

Which algorithm do you expect to perform best on the Iris data? Why?

KNN on the Contact Lens Data

Run KNN on each data set with 1, 3, 5, 7 and 9 neighbors. Report the results for each run in a confusion matrix and comparisons in a table or graph.

Here's a 3-column table in LaTeX. Text in the left column is left-aligned (l), text in the center column is center-aligned (c), and text in the right column is right-aligned (r). Columns are separated by & characters in the table rows, each row ends with verb|\|, and horizontal lines are drawm with verb|hline|. begin{tabular}{|c|c|c|}\ Header & Header & Header \ hlinehline A & B & C \ D & E & F \ hline end{tabular}

Which K gives the best results? Why?

Holding K constant, try different distance functions on each data set. Which distance function(s) work best for each data set? Why?

KNN on the Iris Data

Run KNN on each data set with 1, 3, 5, 7 and 9 neighbors. Report the results for each run in a confusion matrix and comparisons in a table or graph.

Which K gives the best results? Why?

Holding K constant, try different distance functions on each data set. Which distance function(s) work best for each data set? Why?

Decision Trees on the Contact Lens Data

Based on Weka's vizualizations, which attribute do you expect to be chosen as the split attribute at the root node?

Run each decision tree on the data and report the results for each run in a confusion matrix and comparisons in a table or graph.

How do ID3 and J48 compare in terms of performance?

How does pruning affect test performance and generalization performance? What does that suggest about overfitting?

Decision Trees on the Iris Data

Based on Weka's vizualizations, which attribute do you expect to be chosen as the split attribute at the root node?

Run each decision tree on the data and report the results for each run in a confusion matrix and comparisons in a table or graph.

Why can't you run ID3 on the Iris data?

How does pruning affect test performance and generalization performance? What does that suggest about overfitting?

Multilayer Perceptrons on the Contact Lens Data

Experiment with different network structures (e.g. extra hidden layers, extra units). Report the results in graphs that show training time (epochs) versus error rate or accuracy.

Which network structures result in the most overfitting?

Multilayer Perceptrons on the Iris Data

Experiment with different network structures (e.g. extra hidden layers, extra units). Report the results in graphs that show training time (epochs) versus error rate or accuracy.

Which network structures result in the most overfitting?

General

Which algorithm performed best on each data set, for particular definitions of "best?"

Was the (comparative) performance of the algorithms as you expected? Why?

Which data set had the best performance in general across all of the algorithms? Why?

Docutils System Messages

System Message: ERROR/3 (cs4641-project1.txt, line 69); backlink

Undefined substitution referenced: "c|c|c".