Project 4 / Scene Recognition with Bag of Words

This project is about scene recognition using simple and advanced methods. Simple methods are tiny images and nearest neighbor classification. Advanced methods are bags of quantized local features and linear classifiers learned by support vector machines. So these are the functions I implemented:

  1. get_tiny_images
  2. nearest_neighbor_classify
  3. build_vocabulary
  4. get_bags_of_sifts
  5. svm_classify

Getting Tiny Images

For each image path, I got the image and resized it to a 16 x 16 image. Then I reshaped it to a 1 x 256 matrix. Finally, I modified the matrix so that it had a zero mean and unit length (normalizing them).

Nearest Neighbor Classify

I used vl_alldist2 in vl_feat to get the pairwise distances in euclidean values for test image features and train image features. I then sorted the distances and got the indices of the distances in sorted order. Finally, I got the predicted categories by choosing the train labels in sorted order.

Building the Vocabularies

Basically, for each path to the image, I got the image and found the sift features, getting a sift feature every 10 steps. I then appended each sift features for each image together and used k-means to minimize the sum of squared Euclidean distances between points and their nearest cluster centers. The number of clusters is equal to the vocab size.

Get Bags of SIFT

I am given the first two lines to use the vocabularies and get the vocab size. Basically, for each path, I get the image and get the sift features the same way as mentioned before. This time, I used step size of 15 for getting under 10 minutes with the most accuracy (above 60%). I then found the pairwise distances between the sift features and the vocab. I created the histograms using the indices from the sorted distances (specifically 1 through 5 to add on to the histogram (weighted)). I then normalize the historgram for the image and place it in the corresponding image features.

Support Vector Machine Classify

Basically, I implemented a function to train a linear SVM. I get a unique category list from the observed category list first. I iterate through the number of categories and get indices in the training labels that match a particular category. Basically, I create labels by marking matching indices as 1's and others as -1 in a 1 x N array (N = size of training labels). Then I use vl_feat's vl_svmtrain to get W and B for the linear function to separate. Using f(x) = sgn(w * x + b), I get the indices of the max to get the predicted categories.

Things I Have Done and Results

I have spent some time trying to find the right number of steps while keeping under the ten minute time mark for the vl_dsift method in the get_bags_of_sifts function. Currently, I have it as 15 steps taking approximately 9 minutes to run. Of course, I do know that I do not have that fast of a computer so the approximate time may change from computers to computers, but I believe this will be able to pass the allowed time limit. With this, I get about .612 (about 61%) accuracy (mean of diagonal of confusion matrix). This is when I am running get_bags_of_sifts function with svm_classify. Using get_tiny_images and nearest_neighbor_classify, I get around .224 (about 22%) accuracy. Using get_bags_of_sift with nearest_neighbor_classify resulted in about .470 (around 47%).

Click to See More!