Project 4 / Scene Recognition with Bag of Words

This project aims at introducing image recongintion. Tasks of image recognition are perfomed using different feature representations and different classifiers. There are two types of feature descriptors used - tiny images and bag of sift model. The classifiers used are - nearest neighbor model and support vector machines.

The different combinations implemented and tested are as follows:

  1. Tiny images representation and nearest neighbor classifier
  2. Bag of SIFT representation and nearest neighbor classifier
  3. Bag of SIFT representation and linear SVM classifier


  1. Tiny image representation - This feature is inspired by the simple tiny images used as features in scene recognition.
  2. Bag of sift - This feature descriptor relies on computing the 128xN SIFT features.
  3. K nearest neighbor - This classifier computes the closest nearest neighbor from your sample using the distnce metric.
  4. Support Vector Machine - A support vector machine is a discriminative classifier defined by a seperating hyperplane. Given labelled training data the algorithm outputs an optimal hyperplane which categorizes the new examples.

Implementation Details

Tiny Images

For each of the train and the test set I iterate over all the images, and rezise them to [16 16]. These are then reshaped to a 1x256 matrix and stored as image features. The image features are normalised to zero mean and unit length.

Bag of sifts

In the bag of sifts model we construct a histogram by computing the distance between the SIFT features generated and the vocab matrix created. The vocab matrix was computed using the k-means methodology. The counts of the corresponding indices were the minimun distance are retained as a histogram. This histogram is normalised and is the new image feature. The vocabulary I have constructed consists of 500 cluster centers. I had initially tried with the 200 cluster centers,but I wasn't getting the desired accuracy hence I increased the number of clusters. Since I have used the fast parameter the accuracy tends to drop lower than what would be expectd in an ideal scenario.

Nearest Neighbor

For image features(train and test) obtained from either the tiny image features or the bag of sifts method a distance metric is used to compute the distances between the train image features and the test image features. For each of the test image, it finds the closes train image and assigns that label to it.

Support Vector Machine

The method here is to create a classifier using vl_svmtrain for each unique string in the train_labels array. This is done so that a single classifier is created for each of the image type. This agenda is in accrodance with the 1 vs all classifier that we are trying the build. The values returned from vl_svmtrain are then stored for each category. The weights and bias as transpose(w)*x + b where '*' is the inner product and X is the image. The categroy with the heighest weight is considered as the appropriate label for the test image.


Tiny Images and the Nearest Neighbor

Accuracy : 23.5%

Bag of sift and Nearest Neighbor

Accuracy : 48.5%

Bag of sift and SVM

As observed from the values below - On decreasing the value of lambda the accuracy increases.

lambda = 0.1

Accuracy : 53.9%

lambda = 0.01

Accuracy : 54.4%

lambda = 0.001

Accuracy : 58.3%

Graduate Credit and Extra Credit

Vocabulary Size

I have tested with the different vocabulary sizes from 10 to around 700. I saw a resonable change in the accuracy occuring around 500 and thus I chose that as my cluster size. Due to limited computation resource and since the computation was taking a long time I chose to stop at 700. I also figured that since the accuracy doesn't increase to a great extent, it would be just a waste of compuation time and resource to try any futher.

Below are the accuracy values I obtained for the different vocab sizes I tried

Vocabulary Size Accuracy(for SVM)
10 ~20%
20 ~25-28%
50 ~38%
100 ~42%
200 ~50%
500 ~58%
700 ~59-60%

I have been computing all these values using the 'fast' parameter and hence the accuracies reported maybe lower than what would normally be expected.

Fisher encoding

This encoding also serves the similar purpose of summarizing in a vectorial statistic a number of local feature descriptor, which are obtained from SIFT. It is equivalent to creating a bag of visual words where they assign local descriptor to elements in a visual dictionary, obtained with vector quantization. Here this quantization is done using the gaussian mixture models. I have first built vocabulries of the means, covariances and the priors obtained using vl_gmm. These are stored as gmm_means.mat, gmm_covariances.mat and gmm_priors.mat respectively. They have been computed considering 500 centers. These are used to build the sift vectors using the vl_fisher(). The parameters to this function call are the SIFT_features computed for each of the image, means matrix, covariances matrix, priors matrix and 'improved'(this specifies that I am using the improved version of vl_fisher). This encoding is then used as the image feature. This is the code for building the vocabulary

for i = 1:n
    image = imread(char(image_paths(i)));
    [location,SIFT_features] = vl_dsift(single(image),'size', 16,'Step',10); 
    SIFT_features = transpose(SIFT_features); 
    allfeatures = [allfeatures;SIFT_features];
allfeatures = transpose(allfeatures);
[means,covariances,priors] = vl_gmm(single(allfeatures),vocab_size);
means = means';
covariances = covariances';
priors = priors';
This is the code for the encoding. The vl_fisher gives the encoded features of the image using the means, covariances and priors. Also,this is one of the more sophisticated encoding technique.

encoding = vl_fisher(double(SIFT_features),double(means'),double(covariances'),double(priors),'improved');
    encoding = encoding/norm(encoding);
    image_feats(i,:) = encoding;

lambda = 0.001

                                    Accuracy : 52.6%

Surprisingly, I have obtained a better result of 62% accuracy withouth the improved parameter in the fisher encoding.

Gaussian Pyramid

In the gaussian pyramind, subsequent images are weighted down using a Gaussian average (Gaussian blur) and scaled down. Each pixel containing a local average that corresponds to a pixel neighborhood on a lower level of the pyramid. This technique is used especially in texture synthesis.

I have implemented this using the impyramid feature of matlab. I have constructing a pyramid that has 3 levels. I do this using the impyramid command with the reduce parameter, that takes care of the blurring, subsampling and the scaling down. I compute the SIFT features of the images at each of these levels and combine them to represent the final set of features. The rest of the process is same as while computing the bag of sifts for a normal image. I obtain an accuracy value of 64.3% which is an improvement over the previosuly obtained 58.3%. This implementation is only a part of obtaining the bag of sifts.

                                     Accuracy : 64.3%


Adding additional, complemetary gist features and having them all be considered by the classifier. I am appending the gist features to the already calculated histogram of the SIFT features.

The parameters considered are as below:

param.orientationsPerScale = [2 2 2 2]; 
param.numberBlocks = 4;
param.fc_prefilt = 4;
Classifier Accuracy
KNN 54.5%
SVM 60.8%

                  Accuracy for KNN : 54.5%

                       Accuracy for SVM : 60.8%

Non-Linear SVM

I have tried implementing the non-linear conjugate gradient SVM in the file svm_classify.m. I takes a while to compute(since it requires to store the whole matrxi in the memory) hence I have commented it out while running most of my naive SVM classificatio. The accuracy score that I obtained by running the more sophisticated SVM was 60.2%. The kernel function is represented as the variable opt. To obtain the above accuracy I had used the opt value of 0 which corresponds to non-linear conjugate gradient.

Note: Wherever not specified, the classifier used is the SVM

Results - All results have been included in their individual folders inside the code folder:

  1. tinyimages_KNN - Results of tiny images and KNN
  2. bagOfSifts_KNN - Results of bag of sifts with KNN
  3. bagOfSifts_SVM_lambda0.001 - Results of bag of sifts with SVM with lanbda = 0.001
  4. bagOfSifts_SVM_lambda0.01 - Results of bag of sifts with SVM with lanbda = 0.01
  5. bagOfSifts_SVM_lambda0.1 - Results of bag of sifts with SVM with lanbda = 0.1
  6. fisher_results - Results of fisher (bag of sifts where features are encoded using fisher) and vocab built with vl_gmm; classifier:SVM
  7. gaussianPyramid - Results of the gaussian pyramid; classifier:SVM
  8. gist_knn - Results of the gist descriptor; classifier:KNN
  9. gist_svm - Results of the gist descriptor; classifier:SVM