This project had follwing tasks:
I reduced the size of the images to 16*16 and used nearest neighbor classifier for classification. My accuracy was 19.1%
For this, I sampled sift features using vl_dsift from the training set with a step size of 40 and randomly selected 20 such sift features for each image to build a vocabulary of size 30,000. Then I applied K means to cluster these sift features to get the centers which are the final vocabulary. I used K=200. The accuracy was 54.3%
Here, instead of NN classifier, I use SVMs trained for 15 different categories and the test images are assigned to the category with which it attains the best score.
The following table reports my accuracies with different vocabulary sizes.
I performed crossvalidation to check accuracies over 20 runs with randomly chosen 100 training and test images. Here are the results I got
|Mean Accuracy||Standard Deviation|
I built a spatial pyramid with 2 levels, i.e. for each image:
Here are the results with different levels
I tried my hands at Fisher encoding as well. While building vocabulary, instead of k-means on the sift features, I performed vl_gmm over them to get mean and variance. Then in bag_sift_features, instead of building histograms based on proximity to vocabulary words, I found image features and calculated the fisher encoding using vl_fisher.
I got an accuracy of 70.7% with Fisher encoding.
I also tried kernel codebook encoding which is a soft assignment technique. This gives a weighted contribution of each sift feature to the k bins in the histogram.
I got an accuracy of 69.7% with this.
I also tried non linear SVM using RBF kernel and following are the accuracies after tuning parameter sigma