The objective of this project is to perform scene recognition with descriptors(tiny image and bag of sifts) and classifiers(Nearest Neighbors and Support Vector Machine). The combinations implimented and examined were:
Tiny image descriptor was developed by shrinking images down in size and then converting the thumbnail into a vector. The tiny images were then made zero mean and normalized to improve performance. The problem is that this method is very lossy especially in terms of an image's high frequency content and is not very invariant to spatial or brightness shifts. This makes accuracy of classification poor in general as seen from the results below. It is intuitive because we are reducing the size of the image which inturn causes loss of information.
Bag of Sift descriptor was implemented by first generating a vocabulary for the descriptors (build_vocab) to use and label training images. Varying levels of vocabulary sizes gave different accuracies as seen in the extra credit section. The vocab.mat was generated using vl_dsift on each training image to grab sift features before clustering these features with vl_kmeans. After creation of this vocab file, the training images were connected with their labels by finding the clusters that best described them and normalizing the features obtained at these clusters. This descriptor performed better compared to Tiny image with both Nearest Neighbor and SVM classifiers. The step size in vl_dsift of 4 gave goood accuracy but takes a lot of time for execution. step values of 8 or 10 give reasonably good result while running under 10 minutes. The 'fast' parameter to the vl_dsift function also helps increase the speed without considerably afffecting the accuracy of classification.
NN classifier works as follows: The Euclidean distance of each test image is computed with all images in the training dataset. The training image that is closest to the test image is selected as the match and its category assigned as the predicted category. In case of k- nearest neighbours, the closest 'k' training images are identified and their categories retrieved. The most frequently occuring category out of the k selected is assigned as the predicted category.
The SVM generates the parameters of a hyper-plane that divides the points in the feature space such that points lying on one side belong to one category and points on the other side belong to the other category. The SVM training function is called iteratively for each category to separate image that belong to that particular category from those that do not.
Vocabulary size | Accuracy |
---|---|
10 | 42.7 |
20 | 49.1 |
50 | 56.3 |
100 | 62.2 |
200 | 64.9 |
400 | 65.6 |
1000 | 66.3 |