Project 4 / Scene Recognition with Bag of Words

This project consisted of writing 2 feature detectors and 2 classifiers. I then used different combinations of these feature detectors and classifiers to recognize scenes.

Feature Detectors

The first feature detector I wrote was get_tiny_images. In get tiny images I iterated over the images and resized them to be 16x16 images. I then reshaped and normalized them so retain comparability.

The second feature detector I wrote was get_bags_of_sifts. For bags of sifts I had to first construct a vocabulary. I created a vocabulary by iterating over the images and finding the sift features of the image with vlfeat’s vl_dsift. I used a step of 8 and size of 4 as these numbers provided the best initial results. I performed vlfeat’s kmean on all of the features obtained from all of the images. Kmeans clustered the feature and returned a centroid feature for each cluster. I then used this centroid feature as my vocabulary words to compare test features to. I then was able to continue with get_bags_of_sifts. I used vlfeat’s vl_kdtreebuild to build a KD-tree from the vocab. I then iterated over the images and found the sift features for each image using the vl_dsift again. I used the same parameters here as in build vocabulary. I used these features to do a query of the kdtree and create a histogram to show how often a cluster was used for each image. I then normalized the histogram so that image size did not affect the categorization.

Classifiers

The first classifier I wrote was nearest_neighbors_classify. For nearest neighbors I calculated the distance between the test image features and the train image features and found the minimum distance to a train feature from a test feature. I used that train feature to then derive the label of the test feature from the train labels.

The second classifier I wrote was svm_classify, a support vector machine. Here I iterated over the categories and for each category I created a 1 vs. all labeling system where the images in the focused category were denoted by 1 and the images not in the focused category were denoted by -1. I then used this labeling system to create a svm using vlfeat’s vl_svm from the train image features. This svm allowed for a varying value of lambda. I chose a lambda value of 0.0001 because values 0.001 and 0.00001 both resulted in lower accuracies. I then iterated over the test image features and derived a confidence for each category and image. I created this confidence from the w and b values obtained from the svm (W*X + B). I then found the max category confidence for each feature and returned the predicted categories.

Accuracies

Tiny Images and Nearest Neighbor

20.1%

Bag of SIFT and Nearest Neighbor

45.3%

Bag of SIFT and 1 vs. All Linear SVM

59.6%

Extra Credit- Experimenting with different vocabulary sizes

While doing the required parts of the assignment I became very interested in how much the accuracies could change with just a new vocabulary. For example, if I ran the bag of sifts and svm with one vocab I could get a +/-0.05% difference on the accuracy. I decided to learn more about the vocabulary impact by doing this extra credit. I have ran both the svm and the nearest neighbors algorithm with varying vocabulary sizes used in bag of sifts. Run time increased greatly when incrementing the vocabulary size. As expected, the Bag of Sifts and Nearest Neighbor accuracies are consistantly below the Bag of Sifts and SVM accuracies. However, it is interesting to note that they follow a very similar arch with steady increase between 10 and 100 and decrease after. This is most likely due to an increase in words that do not add in categorization. For example, features that are green and of a certain texture and features that are brown and of a certain texture might be both classified as "trees" whereas with more words they might be broken down into leaves and bark which are easier to hallucinate in images without trees.

Scene classification results visualization

(Based on highest non-extra credit run)


Accuracy (mean of diagonal of confusion matrix) is 0.596

Category name Accuracy Sample training images Sample true positives False positives with true label False negatives with wrong predicted label
Kitchen 0.360
LivingRoom

Coast

Bedroom

Store
Store 0.530
Street

Street

Forest

Bedroom
Bedroom 0.420
LivingRoom

LivingRoom

LivingRoom

Street
LivingRoom 0.230
Bedroom

Bedroom

Industrial

Mountain
Office 0.770
InsideCity

Kitchen

LivingRoom

Kitchen
Industrial 0.220
LivingRoom

Kitchen

TallBuilding

LivingRoom
Suburb 0.910
OpenCountry

Store

Street

Mountain
InsideCity 0.520
TallBuilding

TallBuilding

Store

TallBuilding
TallBuilding 0.670
Street

Mountain

Mountain

Industrial
Street 0.590
Bedroom

Industrial

Suburb

InsideCity
Highway 0.790
TallBuilding

OpenCountry

OpenCountry

LivingRoom
OpenCountry 0.480
Coast

Mountain

TallBuilding

Coast
Coast 0.730
OpenCountry

Highway

InsideCity

OpenCountry
Mountain 0.790
Suburb

Industrial

Coast

TallBuilding
Forest 0.930
Store

OpenCountry

Mountain

Street
Category name Accuracy Sample training images Sample true positives False positives with true label False negatives with wrong predicted label