Project 4 / Scene Recognition with Bag of Words

Implementation

Tiny Images

For the tiny image descriptor I load in the images and resize them to be 16 by 16. I then create a 256 dimension descriptor using the resized image and normalize it.

Vocab/Bag of Sifts

To build the vocabulary sift features are extracted from each image and the centers are calculated using kmeans. The centers are then used as the vocabulary. To build the bag of sift features sift features are once again extracted from each image. The min distance between the features and the vocabulary are calculated and a histogram is built from the number of times each cluster in the vocab was used. This histogram becomes our new feature

Nearest Neighbor

For nearest neighbor I take the I find the minimum ditance between the training features and the test features. The image categories are then selected using the nearest neighbor on train_labels

Single Vector Machine

strcmp is used to find which label match and category and then used to construct the binary labels needed to train the SVM. The SVM is trained on the training features with the binary labels we previously created. Training the SVM gives use a weight vector, W and an offset B. The formula W * test feature + b is used to calculate the confidences for the matches. The max of the confidences is taken to select most confident SVM.

Combining Gist and Sift

To combine the gist and sift features I concatenated each gist feature for each image to the sift features of that same image. This gives a 604 by N feature.

Results

For tiny images the the accuracy was 20%. For bag of sifts and nearest neighbor the accuracy was 48.6%. Bag of sifts with SVM gave 62.5% accuracy. The gist-sift combination feature performed marginally better with 64.7% accuracy. Running different cluster numbers had an interest effect.

  1. Clusters 10 = 40.7
  2. Clusters 20 = 46.5
  3. Clusters 50 = 54.5
  4. Clusters 100 = 58.3
  5. Clusters 200 = 56.8
  6. Clusters 400 = 50.1
  7. Clusters 1000 = 43.3

The step size run for these all these vocabulary tests was 15 and 10 for the bag of sifts. Interestingly my accuracy seemded to decrease when using over 100 clusters.

Scene classification results visualization for Best Sift-SVM


Accuracy (mean of diagonal of confusion matrix) is 0.625

Category name Accuracy Sample training images Sample true positives False positives with true label False negatives with wrong predicted label
Kitchen 0.550
Office

Office

Store

Store
Store 0.390
InsideCity

Bedroom

LivingRoom

Highway
Bedroom 0.380
Kitchen

LivingRoom

LivingRoom

Kitchen
LivingRoom 0.340
Office

InsideCity

Store

Industrial
Office 0.850
LivingRoom

Bedroom

Kitchen

Kitchen
Industrial 0.450
Bedroom

Store

Street

Office
Suburb 0.900
OpenCountry

OpenCountry

OpenCountry

InsideCity
InsideCity 0.570
Kitchen

Street

LivingRoom

LivingRoom
TallBuilding 0.700
InsideCity

Store

Store

Mountain
Street 0.540
TallBuilding

OpenCountry

Store

Highway
Highway 0.770
Mountain

Industrial

Coast

InsideCity
OpenCountry 0.490
Mountain

Highway

Highway

Mountain
Coast 0.750
Highway

OpenCountry

OpenCountry

Mountain
Mountain 0.770
OpenCountry

Street

Suburb

OpenCountry
Forest 0.930
Mountain

OpenCountry

OpenCountry

Mountain
Category name Accuracy Sample training images Sample true positives False positives with true label False negatives with wrong predicted label

There are several free variables in the implementation. Variables such as step size for the sift descriptor, and the lamda value in the SVM can have drastic changes on the accuracy acheived. The best Sift-SVM show above had a step of 10 for the vocabulary and a step of 5 for the bag of sifts. The lambda value was 0.000001.