CS 3600, Introduction to AI
Project #4
Supervised Learning

Details

Due: April 29 at 11:55 PM
Please submit via T-square.

Read all the way through this assignment before starting. It is different from previous assignments.

Why?

This project is intended to familiarize you with one of the standard approaches to classification problems, decision trees. You will code up decision tree learning and then apply it to several relatively simple problems. The hope is that, as you work on this project, you will come to understand the supervised learning process.

General Information

This project has two parts. In the first part, you will code up decision tree learning and test it on various data sets. In the second part, you will provide a writeup discussing the results.

This project will be done in Python. Your submission should consist of your code and your writeup. For the code, it should all be in project4.py (see below); follow the standard submission convention you used in projects 1-4. For the writeup, publish it as a doc, rtf, or pdf document and include it with your submission as a separate file.

Part I: Decision Tree Learning (60%)

A classification problem is a problem where you classify instances into classes based on their features. For example, given the features length_gt_5, has_teeth, big_and_scary, and flies, we classify into monster and not_monster.

length_gt_5=y, has_teeth=y, big_and_scary=y, flies=n --> monster
length_gt_5=n, has_teeth=n, big_and_scary=y, flies=y --> monster
length_gt_5=n, has_teeth=n, big_and_scary=n, flies=n --> not_monster
length_gt_5=n, has_teeth=n, big_and_scary=y, flies=n --> not_monster
...

Other possible classification problems would include "Should I date this person or not?" or "Is this a good investment?" or "Animal or not?" or "Is this the picture of a man or a woman?"

A classification problem consists of four variables:

You must implement decision tree learning with information gain, as described in detail in Russell & Novrig 18.3.

For testing purposes, we have provided two datasets, which are in the format described above. The datasets are found in project4.py. Both data sets are binary classification tasks with binary attributes, which means that you only have to implement binary classification for decision trees. Note that the format of the data determines the inputs to your decision tree algorithm, which should follow the general pattern of supervised learning. You are responsible for designing the data structures and functions required to implement decision tree learning on these data sets.

Additionally, we provide two more extra credit data sets that are binary classification tasks with numeric attributes. For your convenience, and because they have only two attributes, we have graphed data sets #3-4:

Data Set #3:

Data Set #4:

Your algorithm should work on data sets #1-2 for full credit. Note that data sets #3-4 are extra credit. You can ignore data sets #5-6. You will be responsible for coming up with a way to tackle this for extra credit (10% of this project's grade), but to receive this extra credit, your code must work for BOTH sets (3 and 4).

Please provide accuracy statistics for your algorithm on the data sets you run them on. WE WILL BE CHECKING THESE AND GRADING YOU BASED ON IT. An accuracy statistic is defined as the percentage of examples the algorithm classified correctly. You should provide accuracy statistics for BOTH the training and testing examples of ALL the data sets.

In terms of coding conventions, please follow the framework we provided in project4.py. Feel free to use any of the utility functions, but make sure your implementation works with our code or else we won't be able to run it and you will have points taken off. Follow the instructions in the comments.

Part II: Analysis (40%)

The analysis should be provided in 1-2 paragraphs. Please keep it short and concise. Here are the questions it should answer: