Project 4 / Scene Recognition with Bag of Words

This project had follwing tasks:

  1. Tiny images representation and nearest neighbor classifier.
  2. Bag of SIFT representation and nearest neighbor classifier.
  3. Bag of SIFT representation and linear SVM classifier.

Tiny Images and NN Classifier

I reduced the size of the images to 16*16 and used nearest neighbor classifier for classification. My accuracy was 19.1%

Results in a table

Bag of Sift with NN

For this, I sampled sift features using vl_dsift from the training set with a step size of 40 and randomly selected 20 such sift features for each image to build a vocabulary of size 30,000. Then I applied K means to cluster these sift features to get the centers which are the final vocabulary. I used K=200. The accuracy was 54.3%

Results in a table

Bag of Sift with SVM

Here, instead of NN classifier, I use SVMs trained for 15 different categories and the test images are assigned to the category with which it attains the best score.

The following table reports my accuracies with different vocabulary sizes.

Vocabulary Size1020501002004001000
Accuracy45.753.260.566.969.970.971.3

Cross Validation

I performed crossvalidation to check accuracies over 20 runs with randomly chosen 100 training and test images. Here are the results I got

Mean AccuracyStandard Deviation
45.65.21

Spatial Pyramid

I built a spatial pyramid with 2 levels, i.e. for each image:

  1. Find sift features
  2. Build histograms over level 0 (No spatial information)
  3. Build 4 histograms over 4 quadrants of image (Some spatial information)
  4. Build 16 histograms over 16 quadrants of image (Some more spatial information)
  5. Concatenate these 21 histograms

Here are the results with different levels

Level012
Accuracy69.970.574.9

Fisher Encoding

I tried my hands at Fisher encoding as well. While building vocabulary, instead of k-means on the sift features, I performed vl_gmm over them to get mean and variance. Then in bag_sift_features, instead of building histograms based on proximity to vocabulary words, I found image features and calculated the fisher encoding using vl_fisher.

I got an accuracy of 70.7% with Fisher encoding.

Kernel Codebook Encoding

I also tried kernel codebook encoding which is a soft assignment technique. This gives a weighted contribution of each sift feature to the k bins in the histogram.

I got an accuracy of 69.7% with this.

Non liner SVM

I also tried non linear SVM using RBF kernel and following are the accuracies after tuning parameter sigma

Sigma10100100010000
Accuarcy23.739.747.946.8