Recognition with Bag of Words

Goal

The objective of this project is to perform scene recognition with descriptors(tiny image and bag of sifts) and classifiers(Nearest Neighbors and Support Vector Machine). The combinations implimented and examined were:

Tiny images representation and nearest neighbor classifier
Bag of SIFT representation and nearest neighbor classifier
Bag of SIFT representation and linear SVM classifier

Descriptors

Tiny Images Feature Descriptor

Tiny image descriptor was developed by shrinking images down in size and then converting the thumbnail into a vector. The tiny images were then made zero mean and normalized to improve performance. The problem is that this method is very lossy especially in terms of an image's high frequency content and is not very invariant to spatial or brightness shifts. This makes accuracy of classification poor in general as seen from the results below. It is intuitive because we are reducing the size of the image which inturn causes loss of information.

Bag of Sift

Bag of Sift descriptor was implemented by first generating a vocabulary for the descriptors (build_vocab) to use and label training images. Varying levels of vocabulary sizes gave different accuracies as seen in the extra credit section. The vocab.mat was generated using vl_dsift on each training image to grab sift features before clustering these features with vl_kmeans. After creation of this vocab file, the training images were connected with their labels by finding the clusters that best described them and normalizing the features obtained at these clusters. This descriptor performed better compared to Tiny image with both Nearest Neighbor and SVM classifiers.

The step size in vl_dsift of 4 gave goood accuracy but takes a lot of time for execution. step values of 8 or 10 give reasonably good result while running under 10 minutes. The 'fast' parameter to the vl_dsift function also helps increase the speed without considerably afffecting the accuracy of classification.

Classifiers

Nearest Neighbor Classifier

NN classifier works as follows: The Euclidean distance of each test image is computed with all images in the training dataset. The training image that is closest to the test image is selected as the match and its category assigned as the predicted category. In case of k- nearest neighbours, the closest 'k' training images are identified and their categories retrieved. The most frequently occuring category out of the k selected is assigned as the predicted category.

Support Vector Machine (SVM)

The SVM generates the parameters of a hyper-plane that divides the points in the feature space such that points lying on one side belong to one category and points on the other side belong to the other category. The SVM training function is called iteratively for each category to separate image that belong to that particular category from those that do not.

Results

Tiny image + K Nearest Neighbor

Bag of SIFT + K Nearest Neighbor

Bag of SIFT + linear SVM

Graduate Credit

Various vocabulary sizes were used Accuracy was recorded. I used the sizes 10, 20, 50, 100, 200, 400 and 1000. The following are the accuracies obtained:

Vocabulary size	Accuracy
10	42.7
20	49.1
50	56.3
100	62.2
200	64.9
400	65.6
1000	66.3

Much increase in accuracy is not seen after the 65% mark. Increasing the vocab size beyond this point only increases accuracy by a small amount, but the code takes a lot of time to run.

Ganesh Venkataraman gvenkataraman6

Project 4 / Scene Recognition with Bag of Words