Project 4 / Scene Recognition with Bag of Words

Pipeline 1: Tiny Image Features and Nearest Neighbor Classifier

This was a fairly straightforward pipeline to implement. To create the features, I simply resized each image to a 16x16 resolution. For the nearest neighbor classifier, I used the single nearest neighbor (1-NN algorithm). The accuracy that I got was 19.1%, which was close to triple the performance of random chance, and within the parameters specified by the project guidelines. This whole pipeline runs in roughly 2 minutes. The confusion matrix is shown below. It looks like the algorithm tended to guess the "outdoor" images more than others.

Pipeline 2: Bag of Sift Features and Nearest Neighbor Classifier

In this section, the bag of words model was implemented. First, I implemented the vocabulary builder. This section took the sift features from the training images (using a step size parameter of 3, for better accuracy) and clustered them together to create the words. I stuck with the default 200 words for the vocabulary size. Getting the bags of words for each image was probably the most complicated piece of the project. First, my code finds the sift feature in a test image, it then determines which word that feature most closely resembles (similarly to how the nearest neighbor algorithm itself operates), and then creates a normalized histogram of word frequency. I used a step size of 16 here on my best run (smaller step sizes took far too long to run, while large step sizes were less accurate - for example, a step size of 100 led to an accuracy of less than .2), which produced an accuracy of .417. The run time was around 35 minutes.

Pipeline 3: Bag of Sift Features and 1-Vs.-All Linear SVM Classifier

The new piece of implementation for the third pipeline was the SVM classifier. Since the SVM was a binary classifier, there needed to be 15 of them - one for each category with a "yes" or "no" label. Through experimentation, I arrived at a lambda of .000001. This classifier increased performance to .476, again with a run time of around 35 minutes. The results are listed at the bottom of this page. This pipeline also performed better on the outdoor images. For example, it guessed forest correctly 83% of the time.

Conclusion

I was disappointed that I wasn't able to get my metrics (run-time and accuracy) within the parameters outlined in the project description. The run times were what held me back, as I wasn't able to experiment with parameters as much as I would've liked too (although I was able to do this to some extent). The accuracies, however, were directionally correct. I was surprised that the perfomance was even this high though, and I enjoyed this look at the foundations of combining computer vision and machine learning.

Scene classification results visualization

Accuracy (mean of diagonal of confusion matrix) is 0.476

Category name Accuracy Sample training images Sample true positives False positives with true label False negatives with wrong predicted label

Kitchen 0.150
Industrial
Industrial
LivingRoom
Mountain

Store 0.260
LivingRoom
Industrial
Industrial
Suburb

Bedroom 0.570
Kitchen
Kitchen
LivingRoom
LivingRoom

LivingRoom 0.260
Bedroom
Bedroom
Bedroom
Office

Office 0.640
Bedroom
LivingRoom
LivingRoom
LivingRoom

Industrial 0.260
Kitchen
TallBuilding
Bedroom
Highway

Suburb 0.550
Coast
LivingRoom
Store
LivingRoom

InsideCity 0.530
Store
Kitchen
Office
Kitchen

TallBuilding 0.560
Store
InsideCity
InsideCity
Store

Street 0.320
TallBuilding
Highway
Kitchen
Suburb

Highway 0.690
Mountain
OpenCountry
Coast
Mountain

OpenCountry 0.310
Coast
Suburb
Street
Highway

Coast 0.600
OpenCountry
OpenCountry
OpenCountry
Suburb

Mountain 0.610
OpenCountry
OpenCountry
OpenCountry
OpenCountry

Forest 0.830
OpenCountry
Street
TallBuilding
Mountain

Category name Accuracy Sample training images Sample true positives False positives with true label False negatives with wrong predicted label