Project 4 / Scene Recognition with Bag of Words

Implementation

Tiny Images

For the tiny image descriptor I load in the images and resize them to be 16 by 16. I then create a 256 dimension descriptor using the resized image and normalize it.

Vocab/Bag of Sifts

To build the vocabulary sift features are extracted from each image and the centers are calculated using kmeans. The centers are then used as the vocabulary. To build the bag of sift features sift features are once again extracted from each image. The min distance between the features and the vocabulary are calculated and a histogram is built from the number of times each cluster in the vocab was used. This histogram becomes our new feature

Nearest Neighbor

For nearest neighbor I take the I find the minimum ditance between the training features and the test features. The image categories are then selected using the nearest neighbor on train_labels

Single Vector Machine

strcmp is used to find which label match and category and then used to construct the binary labels needed to train the SVM. The SVM is trained on the training features with the binary labels we previously created. Training the SVM gives use a weight vector, W and an offset B. The formula W * test feature + b is used to calculate the confidences for the matches. The max of the confidences is taken to select most confident SVM.

Combining Gist and Sift

To combine the gist and sift features I concatenated each gist feature for each image to the sift features of that same image. This gives a 604 by N feature.

Results

For tiny images the the accuracy was 20%. For bag of sifts and nearest neighbor the accuracy was 48.6%. Bag of sifts with SVM gave 62.5% accuracy. The gist-sift combination feature performed marginally better with 64.7% accuracy. Running different cluster numbers had an interest effect.

Clusters 10 = 40.7
Clusters 20 = 46.5
Clusters 50 = 54.5
Clusters 100 = 58.3
Clusters 200 = 56.8
Clusters 400 = 50.1
Clusters 1000 = 43.3

The step size run for these all these vocabulary tests was 15 and 10 for the bag of sifts. Interestingly my accuracy seemded to decrease when using over 100 clusters.

Scene classification results visualization for Best Sift-SVM

Accuracy (mean of diagonal of confusion matrix) is 0.625

Category name Accuracy Sample training images Sample true positives False positives with true label False negatives with wrong predicted label

Kitchen 0.550
Office
Office
Store
Store

Store 0.390
InsideCity
Bedroom
LivingRoom
Highway

Bedroom 0.380
Kitchen
LivingRoom
LivingRoom
Kitchen

LivingRoom 0.340
Office
InsideCity
Store
Industrial

Office 0.850
LivingRoom
Bedroom
Kitchen
Kitchen

Industrial 0.450
Bedroom
Store
Street
Office

Suburb 0.900
OpenCountry
OpenCountry
OpenCountry
InsideCity

InsideCity 0.570
Kitchen
Street
LivingRoom
LivingRoom

TallBuilding 0.700
InsideCity
Store
Store
Mountain

Street 0.540
TallBuilding
OpenCountry
Store
Highway

Highway 0.770
Mountain
Industrial
Coast
InsideCity

OpenCountry 0.490
Mountain
Highway
Highway
Mountain

Coast 0.750
Highway
OpenCountry
OpenCountry
Mountain

Mountain 0.770
OpenCountry
Street
Suburb
OpenCountry

Forest 0.930
Mountain
OpenCountry
OpenCountry
Mountain

Category name Accuracy Sample training images Sample true positives False positives with true label False negatives with wrong predicted label

Category name	Accuracy	Sample training images	Sample true positives	False positives with true label	False negatives with wrong predicted label
Kitchen	0.550					Office	Office	Store	Store
Store	0.390					InsideCity	Bedroom	LivingRoom	Highway
Bedroom	0.380					Kitchen	LivingRoom	LivingRoom	Kitchen
LivingRoom	0.340					Office	InsideCity	Store	Industrial
Office	0.850					LivingRoom	Bedroom	Kitchen	Kitchen
Industrial	0.450					Bedroom	Store	Street	Office
Suburb	0.900					OpenCountry	OpenCountry	OpenCountry	InsideCity
InsideCity	0.570					Kitchen	Street	LivingRoom	LivingRoom
TallBuilding	0.700					InsideCity	Store	Store	Mountain
Street	0.540					TallBuilding	OpenCountry	Store	Highway
Highway	0.770					Mountain	Industrial	Coast	InsideCity
OpenCountry	0.490					Mountain	Highway	Highway	Mountain
Coast	0.750					Highway	OpenCountry	OpenCountry	Mountain
Mountain	0.770					OpenCountry	Street	Suburb	OpenCountry
Forest	0.930					Mountain	OpenCountry	OpenCountry	Mountain
Category name	Accuracy	Sample training images	Sample true positives	False positives with true label	False negatives with wrong predicted label

There are several free variables in the implementation. Variables such as step size for the sift descriptor, and the lamda value in the SVM can have drastic changes on the accuracy acheived. The best Sift-SVM show above had a step of 10 for the vocabulary and a step of 5 for the bag of sifts. The lambda value was 0.000001.

Timothy Storm