CS 4641 Project 2: Unsupervised and Supervised Learning

Due: 2012-07-16

In this project, you will apply several algorithms to two data sets. Please answer each question in the order they appear. Do not skip to later steps to answer earlier questions that ask you to predict outcomes based on your analysis of the data and understanding of the algorithms.

Submit your report, titled <yourgtname>-project2.pdf, as an attachment to a submission to Project 2 on T-Square by the end of the day on the due date. T-Square will be set to accept repeat submissions.

The Data Sets

  1. The telecom churn data set on T-Square
  2. The Iris data from Weka

The Algorithms

  1. Clustering
    • clusterers/SimpleKMeans
    • clusterers/EM
  2. Decision Trees
    • classifiers/trees/J48
  3. PCA
    • attributeSelection/PrincipalComponents
  4. SVM
    • classifiers/functions/SMO
  5. Boosting
    • classifiers/meta/AdaBoostM1

The Report

Copy and paste the text below into a text editor or use this LaTeX template: yourgtname-project2.tex . Submit only a PDF. Sending me a MS Word document is equivalent to sending me nothing.

Support Vector Machines

Run SVMs on the Iris data set from Project 1 using polynomial and RBF kernels.

  • Which kernel works better? Why?
  • How did the SVMs compare to the classifiers from Project 1 in terms of training time and test performance?
  • Run support vector machines on the churn calibration data using polynomial and RBF kernels.
  • Which kernel works better? Why?

Run PCA to reduce the dimensionality of the churn calibration data and run SVMs on the reduced data as you did for the original data.

  • How many principal components did you pick? Why? How much of the variance in the churn data is described by the principal components you chose?
  • How did the SVMs perform on the reduced data compared to the original data? Why?

Reduce the dimensionality of the churn calibration data using subset selection.

  • Compare the process of using subset selection to PCA.
  • How did the performance on the reduced data set using subset selection compare to the PCA reduced data?

Boosting

Run AdaBoost with decision stumps on the Iris data.

  • How did the boosted decision stumps compare to the J48 classifier from Project 1?

Dimensionlity Reduction and Visualization

Use PCA to reduce the dimensionality of the churn calibration data to visualize the data set in two or three dimensions.

  • How much of the variance in the data is described by the first two or three principal components?
  • What does the visualization tell you about the data?

Data Mining with Clustering and Decision Tree Learning

Use one of the techniques dicussed in class to choose a k setting (number of clusters) for k-means and EM on the churn calibration data without the churn labels.

  • How did you choose k? Why?

Run k-means and EM on the data.

  • Are the clusters good? Why?

Use the clusters as labels for the churn data, run decision tree algorithm, extract rules from the data, and give decriptive names to the labels/clusters. You may use the original churn data or use dimensionality-reduced churn data.

  • Is there a reason to prefer dimensionality-reduced data over the original data?
  • Which dimensionality reduction technique did you use? Why?

Now use the clusters as attributes for the churn data instances, add the churn labels, and run SVMs to predict churn as you did in the first part of this project.

  • How did the performance compare to previous methods?
  • What does this suggest about the clusters and the knowledge extracted from them?