Project 4 / Scene Recognition with Bag of Words

Project Aim

The aim of this project is to build a scene recognition system. This classifier system takes image features and classifies different categories of images. Later, various enhancements are done with different methods to test the system on test data. This project has 3 different implementations/methods to classify the scene and recognize the same.
The feature descriptors used are

  1. Tiny Image
  2. Bag of Sifts
The different types of classifiers used:
  1. Nearest Neighbor Classifier
  2. Linear Support Vector Machine
The combinations used:
  1. Tiny images representation and nearest neighbor classifier
  2. Bag of SIFT representation and nearest neighbor classifier
  3. Bag of SIFT representation and linear SVM classifier

Tiny Images

This method is a very simplistic way of representing images. We simply shrink the image data to very small dimensions such as 16 X 16 and then make them zero mean. However, this method is lossy with respect to spatial and brightness shifts in the original image. After this representation is done, it is then sent through the classifier to get the classification accuracy

Bag of Words

  1. Vocab building
  2. I implemented this feature descriptor to represent the information for every image by first building a vocabulary and then using the training and testing data to form the representation. It is called bag of SIFT due to SIFT features being used. Varying levels of vocabulary sizes gave different accuracies (shown as a table in extra credit section). The vocab is built as vocab.mat

  3. Bag of sift
  4. After vocab building is done uing SIFT features, I built the bag of sift representation. SIFT features are calculated for image from training set and then distances is measured between all points to centers. Using the minimum distance the counts for each bin of the histogram is calculated based on the closeness to the cluster center. The histogram bins gives the bag of sift representation for each image

Nearest Neighbor

In the nearest neighbor classifier, I find the distance of every image feature with the correspnding features of the training image and based on the closeness assign the cateogry to which the test image belons

Support Vector Machine

The goal in this project is to implement a linear SVM (Support Vector Machine). The VL_feat SVM package gives the parameters as the output for SVM which is used to classify the category of the test images. The steps are:

  1. Train the SVM using train features and labels
  2. The values returned form SVM train is stored per cateogry
  3. For each test image, classifier is run and based on the higher confidence, the category is defined for the image


Initially vocab size of 200 is used to build and obtain all the following results.

Tiny Image and Nearest Neighbor

I used 1 - Nearest Neighbor and Tiny Images to obtain the following confusion matrix.

Accuracy = 23.8%. Classifier: Nearest Neighbor
Detailed Results Here

Bag of Sift and Nearest Neighbor

I used 1 - Nearest Neighbor and Bag of Sift Representation to obtain the following confusion matrix.

Accuracy = 51.1%. Classifier: Nearest Neighbor
Detailed Results Here

Bag of Sift and Linear SVM

I obtained the following confusion matrix.

Accuracy = 64.9%. Classifier: Linear SVM
Detailed Results Here

Extra/Graduate Credit

I implemented some of the following ideas for the extra/graduate credit. All the tests were done for Bag of Sifts + Linear SVM implementation (otherwise stated)

  1. Vary the vocab size
  2. Vary the step size in order to find the optimal step size value
  3. Cross Validation
  4. Vary the lambda values

1. Vary the vocab size

I did many experiments by increasing the vocabulary size. Increasing the vocabulary size increased the accuracy of the system. However, the drawback was that as the number increased, the implementation speed decreased and it took more time to produce the result. I did the same for the following values:

Vocabulary size Accuracy (%)

2. Vary step size

For building the vocaubulary, I tried my hand at varying the step size. With increasing step size, the implementation was slightly quicker than for lower values. However, as the step size decreased, it was better in terms of accuracy. In this trade of between time and accuracy. The reasonable values were between 8 and 10. I chose 10 for slightly faster implementation. For bag of sifts, Step size of 4 was the best in terms of accuracy. Higher step size resulted in steep decrease in accuracy.

3.Cross Validation

By implementing cross validation, I chose 100 training images and different set of similar testing images, train the classifier on these test images and examine the accuracy on test set. The Result obtained on different runs were:
Iteration     Accuracy

4. Vary lambda values

I varied the lambda values for implementing the Linear SVM to classify the scene. Varying the lambda values, on a generic note, the accuracy increased as the lambda values were decreased. However, like with respect to step size, the computation time increased. (I am not sure if this is true in a generic case). Moreover, one interesting fact was that there were a few outliers, where decreasing lambda value did not increase accuracy, but on a general note that happened.