Project 4 / Scene Recognition with Bag of Words

The goal of this project is to recognize 15 categories of scene using different image representation and and classifier. We then compare the performance between each model. My methods and best results are listed in the table below, and I will give a detailed analysis in the following sections:

Image Representation Classifier My best result in accuracy(%)
Tiny images representation Nearest neighbor classifier 20.2%
Bag of SIFT representation Nearest neighbor classifier 50.9%
Bag of SIFT representation Linear SVM classifier 63.8%
Bag of SIFT representation Non-Linear SVM classifier 69%
GMM & Fisher encoding Non-linear SVM classifier 81.6%
GMM & Fisher encoding Linear SVM classifier 81.8%


Tiny image + Nearest neighbor

Tiny image is a very simple image representation. I resized image to 16x16, and used them as a features. I implemented 1-Nearest neighbor. We compare each feature of test image with every image in training data set and predict the label of the nearest one.

My first attempt of using tiny image with nearest neighbor classifier yeilds result of 19.1%. This is definitely an improvement from 7% chance. After normalizing tiny image with zero mean and unit length, the accuracy is improved to 20.2% with the vocab size of 50.

Confusion matrix for tiny image representation (accuracy: 20.2%) detailed result visualization

Bag of SIFT + Nearest neighbor

I used SIFT descriptors to build a vocabulary of 'visual words' and constructed histograms to represent the frequency of the visual words appear on each image. First, to do build the vocabulary, I extracted the SIFT features from each training image using vl_dsift. Then, I build a vocabulary of 'visual words' by clustering the SIFT features using Kmeans. Finally, I represent each image as a histogram of word frequency.

I improved the accuracy from 48% to 49.6% by normalizing the histogram(vocab size of 50). I also further optimized the design by tuning the step size and bin size of SIFT descriptors, and there is a detailed summary in the next section. In this method, I used step size of 15 to build vocabulary and step size of 5 to get bag of SIFT. The final result with vocab size of 200 is 50.9%.

Confusion matrix for Bag of SIFT + Nearest neighbor(accuracy: 50.9%) detailed result visualization

Bag of SIFT + Linear SVM

This time, I trained the set of sifts features on linear SVM which calculates a weight(w) and bias(b) for each category (vl_svmtrain). Then I computed the confidence weight for each test image by plugging its feature vector into w*x+b. For each image, the category is determined by the highest confidence. Initially, I used a relatively large lambda, and it is proven that smaller lambda tends to give better results. Finally, I set 0.000001 as the value for lambda, yeilding accuracy of 63.8%.

Confusion matrix for Bag of SIFT + Linear SVM(accuracy: 63.8%) detailed result visualization

Extra Credit

Parameterization

The classification accuracy is greatly effected by the choice of the parameters for finding features, clustering visual words and training the SVMs.
The paramters that worked best for me are listed in the following table. Since there is a trade-off between accuracy and computation speed when classfying a large set of images, I chose parameters that yielded reasonable results and did not take too much time to compute (take around 5 minutes to compute in most cases).

Step Free Parameter Value
build_vocabulary Vocabulary size 200
SIFT step size 15
SIFT bin step 8
get_bag_of_sifts SIFT step size 5
SIFT bin step 8
svm_classify lambda 0.000001

Vocabulary Size

The figure above shows the impact of the vocabulary size. While too small and too large vocabulary sizes lead to inaccurate classifications, choosing a size between 200 and 300 yield the best results.

Bag of SIFT + Non-linear SVM

Now, it's time to use more sophisticated kernels instead of just linear svm. I tried RBF kernel using Olivier Chapelle's MATLAB implementation of primal_svm. With the same parameters and vocab size(200) as before, rbf kernel improved the accuracy to 69% with sigma = 1, which is a reasonable improvement from linear svm of 63.8%.

Confusion matrix for bag of SIFT + Non-linear SVM(accuracy: 69%) detailed result visualization

GMM & Fisher encoding + Non-linear SVM

Now it's time to optimize the design by addressing the quantization error. I used GMM and fisher encoding, a more sophisticated feature encoding schemes analyzed in the comparative study of Chatfield et al. The Fisher Vector representation of visual features is an extension of the bag of visual words. The previous method uses K-Means to generate the feature vocabulary, where as now we uses Gaussian Mixture Models (GMMs) to do the same. So instead of constructing a hard coded codebook, we generate a probabilistic visual vocabulary. And then we can compute the gradient of the log likelihood with respect to the parameters of the model to represent an image or video. The Fisher Vector is the concatenation of these partial derivatives and describes in which direction the parameters of the model should be modified to best fit the data. Therefore, This method gives better classification performance than BOV obtained with Kmeans.

First, I computed the statistics of training data using vl_gmm, and then computed the fisher vector signature for each image using vl_fisher. With the same rbf svm and parameters as before, the accuracy of using fisher encoding is 74.4%, which is a great improvement from 69% of BOW. The vocab size is still 50. Later, I increased the vocab size to 200, and get an accuracy of 77.6%.

Confusion matrix for fisher encoding + Non-linear SVM(accuracy: 77.6%) detailed result visualization

I also apply power normalization and L2-normalization on the resulting fisher vectors, which boosts the result to 81.6%. Feel free to test the program by uncommenting %FEATURE = 'fisher encoding'; on line 21 in proj4.m.

Confusion matrix for fisher encoding(normalized) + Non-linear SVM(accuracy: 81.6%) detailed result visualization

GMM & Fisher encoding + Linear SVM

Fisher encoding even performs well with simple linear classifiers. The best accuracy I got so far is 81.8%. If we have a good model like Fisher encoding, a simple linear classifier will work just fine. Where as non-linear SVM tends to add unwanted complexity, or starts to overfit our data! A significant benefit of linear classifiers is that they are very efficient to learn and easy to evaluate.

Confusion matrix for fisher encoding + Linear SVM(accuracy: 81.8%) detailed result visualization