Project 4 / Scene Recognition with Bag of Words

This project has five components to it: Tiny Image Representation, Nearest Neighbor Classifier, Build Vocabulary, Bags of Sifts, SVM Classifier

Tiny Image Representation

For the tiny image, I had to go through all the images in the image_path array, resize each image to 16x16. After that I made each row 0 mean by subtracting the mean of each row of the image_feat matrix from its values. In addition I normalized all of its values. There were no major design decision for this function as there are no free parameters.

Nearest Neighbor Classifier

The nearest neighbor classifier involved taking the pairwise distances between the training data and testing data points and then computing a histogram from comparing the unique labels from the training instances and the top k labels. The free parameter in this algorithm is k which indicates the number of nearest neighbors that are taken into consideration. I tested values from k=5 to 23 and arrived at k=20. At k=20, my accuracy for Tiny Image + Nearest Neighbor Classifier was around 22.2%

Build Vocabulary

For this function, I iterated through each image and determined its SIFT features using step size 10. I tried step sizes 4, 5, 10 and 10 gave me the best results. Moreover, 4 and 5 were too slow as Vocabulary by itself took around 5 mins with those values. Now I initialized a descriptors matrix which has dimensions 128x1, however, I am inserting the SIFT features obtained from vl_dsift as columns into that descriptions matrix. Initially I tried fixing the number of features per image at 100000/num of images but the accuracy was dampened as a result. Finally I called vl_kmeans as a way to create a cluster of size of vocab_size

Bags of Sifts

For bags of sifts, I iterated through the images and obtained its SIFT features. Then computed the pairwise distances between each SIFT feature and the vocab. Afterward I obtain the indices of minimum distances across the rows and construct the histogram using the vocab_size as the number of bins. Then I normalized the histogram. I used a step size of 10 here as it gave me the best results.

SVM Linear Classifier

The SVM linear classifer iterates through the possible categories which is a set of the train_labels. Then in each iteration, it the category with the class_labels and those that match are assigned a value of 1. This is binary classification so the classes can either be 1 or -1. To obtain the thetas and intercepts I called vl_svmtrain which also involved setting the LAMBDA variable. I tested LAMBDA from values of 1 to 0.000001 and the value of 0.0001. The higher the LAMBDA value the higher the regularization which basically penalizes for overfitting. After getting the Ws and Bs, I calculated the hypothesis by using theta*x + b and got the best hypothesis by calling max.

Results for each combination

Name Accuracy
Tiny Images + Nearest Neighbor 22.2%
Bags of Sifts + Nearest Neighbor 48.3%
Bags of Sifts + Linear SVM 58.2%

Best Performing Recognition Setup

Confusion Matrix and Table of Classifier Results

Extra Credit

This section explores the effect of sampling features at different levels in the Gaussian pyramid. The following summarizes the effects in a tabular format.

Impyramid direction parameter Level Accuracy with Linear SVM
Reduce 1 37.1%
Reduce 2 14.6%
Expand 1 57.2%

The overall effect of the direction parameter seems to be that a reduction results in a lower accuracy as there are fewer features with which the classifier can compare two images.