Project 4 / Scene Recognition with Bag of Words

Overview

The goal of this analysis, and the implemented algorithms, is to get a feel for different ways to recognize content of an image and assign it to an appropriate category. Essentially, it is a classification problem that I will be tackling using 3 different approaches: tiny images coupled with 1-nearest-neighbor algorithm, bags of quantized local (SIFT) features coupled with 1-nearest-neighbor algorithm, and bags of features with support vector machine algorithm. There are many parameters to tweak in order to achieve best results, which will be a noticeable part of this analysis.

Tiny images + 1NN

To start off, I implemented tiny images features. It is a fairly simple algorithm that resizes an image to a fixed-size smaller version of itself and stores it as a vector feature that can be used either for training or testing. One improvement that noticeably increased the performance of this algorithm for me (by at least a couple percents) is normalization of the resultant feature vector, which makes sense, since it should minimize the effect of things such as brightness of the scene on its recognition and allow it to focus on other things.

Then, came time to implement 1-nearest-neighbor algorithm. The idea behind it is fairly intuitive: for a given image in the testing set, find the closest to it image-feature in the training set (using L2 distance metric) and assign the same category to the test image as the training one has. In the end, the results were fairly pleasing with the accuracy of 22.5%, as opposed to ~7% accuracy if randomized choice was used.

Bag of features + 1NN

Bag of SIFT features was a bit trickier to implement because it actually had a training step, as opposed to tiny images, which is vocabulary building. The idea behind vocabulary building is to create a number of cluster centers using SIFT features extracted from all of the training images at once to be used later. After the vocabulary is created, in get_bag_of_sifts, I extract SIFT features from every image and create a histogram for each picture by finding out which of the cluster centers from the previous step every SIFT feature is closest to and incrementing the dimension representing that cluster in the histogram. After that is all done, I normalize histogram for each image individually. And for the classification step, the same 1NN methodology is used as in the above

This algorithm also had a noticeable amount of degrees of freedom in terms of variables I could tweak in orded to achieve the best results. In the end, I found that having step of 8 for get_bag_of_sifts produced the most reasonable performance, while step of 40 worked beautifully for build_vocabulary. Even though the lower step value usually means the better, lowering it any more didn't produce noticeable improvements, but very noticeable runtime increases. As for the vocab size, the best value turned out to be 400, though I will be analyzing the vocab size tuning later on. Using values above, I managed to get my accuracy as high as 54.9%.

Bag of features + SVM

The main algorithm here is the same as in the previous step (vocab+bag of features), however, the classification step itself is very different. Instead of just picking the closest neighbor, a support vector machine is created to rank the histogram features against every single classification category (in our case, type of room) using the confidence produced by the margin formula W*x+B and, in the end, pick the category that produces the highest confidence as the answer.

The crucial free variable in this step was lambda used by vl_svmtrain() function as a regularizer of the linear classifier. I started with assigning it to 10 and then lowering it by a factor of 10 every until I see improvement dropping and then increasing it back by a slightly smaller amount in order to achieve an optimized "hill peak" of performance, sort of like a manually performed simulated annealing. In the end, I stopped at the value of lambda being 0.000076, which netted me a 67.8% accuracy.

Overall, bag of features+SVM combination produced the best results out of all combinations tested above, hence why it deserves a more detailed data presentation, not just a number and a confusion matrix, which can be observed below.

Category name Accuracy Sample training images Sample true positives False positives with true label False negatives with wrong predicted label
Kitchen 0.630
Office

Industrial

Office

Office
Store 0.520
LivingRoom

InsideCity

LivingRoom

LivingRoom
Bedroom 0.420
LivingRoom

Kitchen

LivingRoom

LivingRoom
LivingRoom 0.350
Bedroom

Kitchen

Kitchen

Bedroom
Office 0.830
Kitchen

Industrial

Bedroom

Kitchen
Industrial 0.550
Store

Store

Kitchen

TallBuilding
Suburb 0.970
Industrial

Highway

InsideCity

Store
InsideCity 0.560
Highway

LivingRoom

Street

TallBuilding
TallBuilding 0.820
OpenCountry

Store

Kitchen

Industrial
Street 0.650
Forest

OpenCountry

InsideCity

Mountain
Highway 0.820
Coast

OpenCountry

OpenCountry

Street
OpenCountry 0.440
Mountain

Mountain

Coast

Highway
Coast 0.780
Mountain

TallBuilding

OpenCountry

OpenCountry
Mountain 0.890
LivingRoom

Forest

OpenCountry

OpenCountry
Forest 0.940
OpenCountry

OpenCountry

OpenCountry

Industrial
Category name Accuracy Sample training images Sample true positives False positives with true label False negatives with wrong predicted label

Vocabulary size variance

I decided to perfrom an extensive tuning of the vocabulary size in order to come up with the most optimal size that I used in all of my trials above, and I came up with 400. Below, you can see graphs, as well as a table that has results of my trials written down. Important to note that after crossing the threshold of 500, the code started running noticeably slower, while at 1000 it became unbearably slow.