Project 4 / Scene Recognition with Bag of Words

This project focused on the task of recognizing various scenes using different kinds of image features and machine learning methods. The three combinations of features and matching methods used for the project were:

  1. Tiny images and nearest neighbor classifier
  2. Bags of SIFT words and nearest neighbor classifier
  3. Bags of SIFT words and 15 1-vs-all support vector machines

Part 1

The first part of the project required coding a rather ineffective image feature, "tiny images". This encompassed reducing an image to the size of 16x16, making the average of the pixels' intensities zero, normalizing, and returning a 1x256 vector. In order to compare scenes' features for matching, I started by encoding the nearest neighbor classification method, which involved giving each testing image feature the label of the training image feature it was the closest to, in terms of pairwise distances. The result of this combination was an average of 22.4% accuracy as demonstrated by the confusion matrix below:


Accuracy (mean of diagonal of confusion matrix) is 0.224

Category name Accuracy Sample training images Sample true positives False positives with true label False negatives with wrong predicted label
Kitchen 0.080
Bedroom

Street

Coast

Bedroom
Store 0.020
Forest

Forest

TallBuilding

InsideCity
Bedroom 0.180
Industrial

Office

LivingRoom

Highway
LivingRoom 0.100
Coast

Bedroom

Mountain

Industrial
Office 0.160
Forest

LivingRoom

Forest

Bedroom
Industrial 0.120
Forest

TallBuilding

Highway

Highway
Suburb 0.370
Kitchen

Street

InsideCity

Kitchen
InsideCity 0.060
Store

LivingRoom

Street

Coast
TallBuilding 0.230
Industrial

Office

Mountain

Industrial
Street 0.420
Store

InsideCity

Industrial

LivingRoom
Highway 0.560
Street

LivingRoom

Coast

Mountain
OpenCountry 0.350
Coast

Forest

Coast

Coast
Coast 0.400
Office

Mountain

OpenCountry

InsideCity
Mountain 0.180
InsideCity

Kitchen

Coast

Office
Forest 0.130
Kitchen

Store

Suburb

Coast
Category name Accuracy Sample training images Sample true positives False positives with true label False negatives with wrong predicted label

Part 2

The next part of the project involved using a more sophisticated and improved image feature, the bag of words. In the case of images, a "word" was a SIFT feature sampled from an image that either featured often in the image or had many SIFT features similar to it. The first step in the process was building a vocabulary of "visual words" which SIFT features could be compared to. This was done by performing the unsupervised machine learning method known as k-means clustering, with k = 200. This produced 200 SIFT features which were the centers of "clusters" of similar SIFT features. Constructing a bag of SIFTs for each image encompassed making a histogram of 200 bins and filling each bin with the number of SIFT features that were close to the bin's corresponding SIFT "word". The SIFT histogram was then normalized and used as a feature representing its image. The bags of SIFTs were then used in tandem with the nearest neighbor classifier to see if they yielded better results in comparison to the tiny images. The result was 50.3% accuracy with a step size of 10 for sampling SIFT features while building the vocabulary and a step size of 10 while sampling SIFT features during the bag building process. Using a step size of 7 in the bag building process yielded a higher accuracy of 52.1% but resulted in the process taking longer (a little less than 12 minutes) Look at the confusion matrix and table below to see the results of the 52.1% accurate results.


Accuracy (mean of diagonal of confusion matrix) is 0.521

Category name Accuracy Sample training images Sample true positives False positives with true label False negatives with wrong predicted label
Kitchen 0.390
Office

InsideCity

Office

Office
Store 0.490
Bedroom

LivingRoom

Industrial

LivingRoom
Bedroom 0.280
LivingRoom

InsideCity

Kitchen

Office
LivingRoom 0.350
Bedroom

Store

TallBuilding

Store
Office 0.730
Bedroom

Bedroom

Bedroom

LivingRoom
Industrial 0.270
Street

Coast

InsideCity

InsideCity
Suburb 0.890
OpenCountry

InsideCity

Highway

Street
InsideCity 0.290
Store

Store

Kitchen

Highway
TallBuilding 0.370
InsideCity

Industrial

Street

LivingRoom
Street 0.550
LivingRoom

Suburb

Store

Suburb
Highway 0.770
Industrial

Coast

Store

OpenCountry
OpenCountry 0.450
Highway

Industrial

Suburb

Highway
Coast 0.520
InsideCity

TallBuilding

Highway

Highway
Mountain 0.550
Coast

OpenCountry

Highway

OpenCountry
Forest 0.910
TallBuilding

OpenCountry

Mountain

Mountain
Category name Accuracy Sample training images Sample true positives False positives with true label False negatives with wrong predicted label

Part 3

The final task in this assignment was attempting to further improve the scene recognition's performance by coding more proficient classifiers. The classifier that was coded for this part of the process was a group of 15 support vector machines. A support vector machine is essentially a binary classifier represented as an equation w*x +b where x is a vector of inputs and w is a vector of weights. A 1-vs-all SVM was generated for every category of labels using bags of sifts from a set of training images. Bags of sifts from testing images were then processed through the equations that represented the SVMs, and predictive labels were assigned to the testing image depending on which SVM yielded the highest score. The highest accuracy of 67.0% was yielded by using an SVM regularization factor of 0.0001 and a step size of 7 to allow for denser sampling of SIFTs. However, this process could prove to be slow at close to 11 minutes. Therefore, changing the step size to 10 while keeping the regularization factor the same resulted in a faster speed of about 6 minutes but a lower yet significantly decent accuracy of 63.9%. Keeping the step size at 10, but decreasing the regularization factor to 0.00001 was not helpful, for it yielded an accuracy of 59.3%. Here are the results for the best run which yielded 67%:

Scene classification results visualization


Accuracy (mean of diagonal of confusion matrix) is 0.670

Category name Accuracy Sample training images Sample true positives False positives with true label False negatives with wrong predicted label
Kitchen 0.580
LivingRoom

LivingRoom

Bedroom

Bedroom
Store 0.530
LivingRoom

Kitchen

Bedroom

Office
Bedroom 0.390
LivingRoom

Kitchen

Office

Store
LivingRoom 0.300
InsideCity

Bedroom

InsideCity

Kitchen
Office 0.870
LivingRoom

Bedroom

LivingRoom

Kitchen
Industrial 0.540
InsideCity

Store

TallBuilding

Mountain
Suburb 0.930
Highway

Coast

Store

Street
InsideCity 0.590
Street

LivingRoom

Industrial

Store
TallBuilding 0.750
Bedroom

Store

Mountain

Industrial
Street 0.690
Industrial

InsideCity

LivingRoom

Industrial
Highway 0.770
Street

OpenCountry

Coast

Suburb
OpenCountry 0.490
Coast

Industrial

Forest

Highway
Coast 0.830
OpenCountry

Industrial

Mountain

Mountain
Mountain 0.850
LivingRoom

TallBuilding

Forest

Suburb
Forest 0.940
Mountain

OpenCountry

Mountain

Mountain
Category name Accuracy Sample training images Sample true positives False positives with true label False negatives with wrong predicted label