Project 4 / Scene Recognition with Bag of Words

This report contains analysis of various scene recognition techniques. In all the techniques implemented as part of the assignment, being conscious of the free parameters and understanding the impact of changing them was crucial. Most of the report highlights how accuracy varies with change of the free parameters.

  1. Tiny images
  2. Bag of words with nearest neighbours
  3. Bag of words with linear SVM classifier
  4. Extra credit - Kernel codebook encoding
  5. Extra credit - Impact of vocabulary size on performance
  6. Extra credit - Fisher encoding

Tiny images confusion matrix. Accuracy 0.209

Tiny images

Tiny images has been implemented as a baseline scene recognition. Tiny images are representative of thumbnails of an image where the image is resized 16 x 16 image. Creation of tiny images is a lossy operation because high frequencies are lost. The center square of the image is cropped out and resized to a 16x16 patch. Zero-mean and normalization only helps the accuracy to 0.209 with K of KNN to be 1

Bag of words with nearest neighbours

The main steps of Bag of words includes -

  1. Building vocabulary
  2. Encoding features

Building vocabulary

Bag of words is a methodology borrowed from text analysis - replacing text content with images and words with features. For this section, features used are 128 dimensional SIFT features. Bag of words uses K-means clustering to cluster SIFT features of multiple images. Vocabulary in the bag of words constitutes these cluster centers. Vocabulary is constructed using the training data set. If the training data is representative of the entire data set, we should have good cluster centers found in the vocabulary, for the test data to match with.

%Free parameters for tweaking
STEP_SIZE = 10 % Step size for vl_feat library. 
%This has been retained at 10 across all vocabulary sizes since 10 has given reasonable performance and builds the vocabulary in reasonable time. VOCAB_SIZE = 500 % Size of the vocabulary to be built.
% While this is not necessary to evaluate just the performance of Bag of words and knn, this has been used for analysis in extra credit section

Encoding features

Once the vocabulary has been built, we now need to make our images comparable in a classifier. To achieve this, images are represented as encoded features. Feature encoding is a simple process of finding the nearest cluster centroid in the vocabulary and creating a histogram by features voting for the cluster centroids. The histogram is normalized. Once both testing and training images have been encoded, they can now be classified.

%Free parameters for tweaking
STEP_SIZE = 5 % After multiple tweaks, 5 seems to yield reasonable results. This parameter has not been varied greatly.


The analysis for this part was done with varying vocabulary sizes ranging from 20 - 500. The best acccuracy found was 0.529 for a vocabulary size of 500. The same configuration without normalization was giving an accuracy of just 0.505. While tiny images worked best with K = 1 on the knn, bag of sift gives the best performance with k = 5 This shows that K is a very important free parameter that needs to be tuned to avoid overfitting and generalization. K also varies with the vocabulary size since the vocab size defines the density of clustering and the centroids. Other parameters that had to be tweaked were the vl_sift parameters - namely the step size and the bin size for SIFT algorithm. The step size while building the vocabulary was 10 and while generating the bag of sift features was 10. The bin sizes that were tested ranged from 8,16,32. Below given results are for bin size = 16

K- value accuracy

Bag of sift with knn results webpage

Best bag of sift with accuracy 0.529

Bag of words with linear SVM classifer

Linear SVM has produces far better results than the KNN classifer. Linear SVM as a one-vs-all classifer tries to learn the hyperplane parameters that separate one category from other categories. Apart from vocabulary size and SIFT free parameters that are applicable to all methods mentioned till now, one of the most important parameters for the linear SVM is the lambda. It has been tweaked at multiple levels in the order of 10, the best accuracy of


Results page can be viewed here

Most reliable SVM - without likely overfitting - 0.655

These results were obtained for lambda of 0.001 against a vocabulary size of 500. More detailed analysis of lambda tweaking below.


varying lambda definitely impacts the performance of the linear SVM, although it mostly stays above the KNN classifier. This is mostly because in the KNN, we only pick the nearest neighbour in any case whereas with varying lambda we get finer control over sitations where cluster centers might be really close and knn would have mis classified. Below is a table showing varying results based on lambda values. Too large a lambda generalizes the classifier. Too small a lambda tends to overfit. Please note - these values have been reported for a vocabulary size of 500

%Free parameters for tweaking
LAMBDA = 0.001 % Influences regularization and thereby overfitting or underfitting.
lambda - value accuracy

Extra credit - Kernel codebook encoding

A specific disadvantage of quantizing vectors is that if there are two clusters that are very close to each other, there is a strict assignment of two closeby features to these clusters where one cluster is strictly favored over the other. When these two features were actually meant to be assigned to the same cluster, this strict assignment reduces accuracy. Remediation - using soft assignment or weighted binning while creating the histogram in the bag of words model i.e. a feature votes for more than one cluster centroid. The weightage I have considered is a Gaussian window weighting for 3-5 clusters.

Code is implemented as part of get_bags_with_soft_assign.m

Code snippet

weightage = transpose(gausswin(K));
num_weights = ceil(size(weightage,2)/2):size(weightage,2);
weightage = weightage(1,num_weights);
for idx = 1:size(image_paths,1)
    if mod(idx,100) == 0
    image = im2single(imread(image_paths{idx}));
    [locations, SIFT_features] = vl_dsift(image, 'fast', 'step', 10);
    [idx_array, distances] = knnsearch(vocab, single(SIFT_features'), 'K', 3);
    for n=1:size(SIFT_features,2)
        for k=1:num_weights
            image_feats(idx, idx_array(n,k)) = weightage(k) + image_feats(idx, idx_array(n,k));



As reported previously, bag of sift without soft binning on a knn of k = 5 was giving an accuracy of only 0.529 for a vocabulary of 500. The same configuration with soft binning in place with a window of 5 gives 0.647 Previously missed out classification matches (due to strict assignment to nearest cluster center) will now not be missed out due to soft binning advantage

Extra credit - Impact of vocabulary size on performance

Vocabulary size becomes a free parameter because it defines the granularity of the clusters. A very high vocabulary size implies that plenty of cluster centers are present. I have experimented with various vocabulary sizes ranging from 20 - 5000. Since this dataset contains a small number of categories (15), vocabulary size does not create a huge impact beyond certain point. While switching from 200 - 500 is where the major performance increase is visible. Beyond 500, for other vocabulary sizes like 1000 and 5000, (with modifications to other free parameters like sift bin size, step size, k of knn) great performance improvements are not visible. Average accuracy for bag of words (calculated over 5 runs for smaller vocabulary sizes and 3 runs for larger sizes - takes very long to run since it was run without fast parameters) is tabulated below

Code is implemented as part of get_bags_with_soft_assign.m

Vocabulary Size Average Accuracy

Extra credit - Fisher encoding

Steps to perform Fisher encoding- Using a GMM, build a vocabulary. Use the GMM to encode training features and testing features. Use a linear SVM to classify the encoded training and testing features.

Code is implemented as part of build_gmm_vocab.m and get_fisher_vector.m

Code snippet for gmm vocabulary building

num_images = size(image_paths, 1);
% approximately we want to have 10k features for clustering all in all
features_per_image = ceil(10000/num_images);
data = [];
for idx=1:num_images
    image = im2single(imread(image_paths{idx}));
    [locations, SIFT_features] = vl_dsift(image, 'fast', 'step', 10);
    data = [data SIFT_features(:,1:features_per_image)];
% [centers, assignments] = vl_kmeans(single(clustering_data), vocab_size);
% vl lib transposes everything with respect to this code
% vocab = centers';

% data = rand(dimension,numFeatures) ;

[means, covariances, priors] = vl_gmm(single(data), vocab_size);

Code snippet for fisher vector encoding

image_feats = [];
data = [];
for idx = 1:size(image_paths,1)
    image = im2single(imread(image_paths{idx}));
    [locations, SIFT_features] = vl_dsift(image, 'fast', 'step', 5);
    fisher = vl_fisher(single(SIFT_features), means, covariances, priors);
    image_feats = [image_feats;fisher'];



Out of all the vocabulary building and feature encoding techniques explored as part of this assignment by me, fisher encoding produces the best results. Fisher encoding achieves 0.765 accuracy which is much higher than the 0.408 that is achieved with bag of sift encoded as histograms without soft binning . Even with soft binning this level of accuracy cannot be achieved on small vocabulary sizes like 200 or even large vocabulary sizes of 1000 - 5000. This is primarily because, in all other cases, there was some level of thresholding enforced with strict assignments. In case of GMM and fisher encoding, we are alyways trying to better learn / better fit based on the statistical calculations of the representative data. As long as our training set is representative of the real world data, fisher encoding gives very good results. Another huge advantage is that fisher encoding does not seem to vary on the vocabulary size because it is anyway accounting for the general behaviour of the data that is given in the training set.

Detailed results web page Fisher Results webpage

Accuracy of 0.765