Project 4 / Scene Recognition with Bag of Words

Note: Most of the analysis has been submitted as graphs and are present as a single graph at the bottom of the report.

For this project we're classifying scenes into one of 15 categories. We do this by extracting features from each image and then train a classifier over the extracted features. We're using two types of features - tiny images, which serves as a baseline and bag of sift features. We're also using two types of classifiers - nearest neighbors and support vector machines.

Feature Extraction

Tiny images

For the tiny image feature, I resized the images to 16X16 in dimension and flattened the matrix to a vector. Since these are just being used as a baseline, I haven't done any analysis on them. Using tiny images and nearest neighbors, my accuracy is 17.6 %.

Bag of SIFT

In the bag of sift features, we first build a vocabulary from the train images, and look for the vocabulary in the test images. For building the vocabulary, we extract SIFT features from each image, and cluster them. The cluster centers become our vocabulary. Here is the code to do so.

    features = [];
    % Extract SIFT features
    for i = 1:size(image_paths,1)
        img = imread(image_paths{i});
        [~, feat] = vl_dsift(im2single(img),'step', 10);
        features = [features feat];
    % Find cluster centers.
    [vocab, ~] = vl_kmeans(double(features), vocab_size);
    vocab = vocab';


I have found that taking a random sample of the features extracted and then finding the cluster center of this sample instead of using all the features generated improves my accuracy when using the SVM classifier. My accuracy jumped from 47% to 59% for SVM when I used a random sample instead of all features. The nearest neighbors classifier doesn't display this behaviour. The change in accuracy was around 0.3%, but this can be accounted for by the randomness in building the vocabulary.


Nearest Neighbor Classifier

I implemented k-Nearest Neighbor Classifier instead of 1 Nearest Neighbor.


I found that increasing k increased the accuracy. The graph of the accuracy for varying k is plotted in the end under the extra credit section. TODO: Plot accuracy vs k and accuracy vs step size. I also tried using approximate nearest neighbors using the vl_kdtree function. I found that it reduces the accuracy by about 20%.

Support Vector Machine

The SVM is an optimal soft margin classifier that tries to find the best line that can separate points belonging to two classes. It tries to find the line with the maximum separation from each class and least number of misclassifications. The trade of between the two optimization functions can be controlled by a parameter lambda in vl_svmtrain function. For large values of lambda, the optimization will choose a smaller-margin hyperplane if that hyperplane does a better job of getting all the training points classified correctly. Conversely, a very small value of lambda will cause the optimizer to look for a larger-margin separating hyperplane, even if that hyperplane misclassifies more points.


TODO:Analysis for changing lambda

Extra credit

Complementary Features

I combined sift features with gist features. Gist features are global features unlike sift which are local. Thus each image gives one gist feature.


There are multiple ways to combine different features. I've tried three different ways.

  1. Concatenating gist feature to combined sift feature extracted from image. In this technique, after finding the count of sift features for each word in the vocabulary, we normalize the resulting vector. We then extract a gist feature for the same image and then append it to the end of the previously obtained vector. This vector is then passed into the classifier. The advantage of this method is that we don't need to make any modifications to the classifier. I found that this didn't change my SVM accuracy by much though. The knn accuracy increased from 58.5% to 59.9% for k = 5.
  2. This method is for knn. In this method, I passed the sift features and the gist features separately into the knn classifier. I calculated the distances from test image features to train image features for each separately and added up the distances for the corresponding pairs. This increased my knn accuracy from 58.5% to 66.7%.
  3. This method is also suited for knn, but can be used for SVM too. I found out that using this method for SVM did not increase it's accuracy either. In this method, I found out the nearest neighbors using both features separately, and then ranked the neighbors in ascending order for both of them. I then averaged the ranks obtained from sift features and gist features and used this average rank to classify the images. This lead to a similar increase in accuracy as the previous method, giving about 66.4%.

Reducing the gist descriptor size reduced the classification accuracy. I got the best accuracy for a descriptor size of 512.

Soft assignment/Kernel codebook encoding

In the kernel codebook encoding, instead using histograms, we use kernel density estimation to assign the probability of a feature belonging to a word in the vocabulary. I have used a Gaussian kernel in accordance with the paper referred by Chatfield et al. After extracting the features, I calculate the distances from the vocabulary. I apply the kernel function on the distances. The code is as follows.

    % Brind D to between 0 and 1 so that their gaussian doesn't blow up
    D = D - min(D(:));
    D = D ./ max(D(:));
    % Apply Gaussian kernel
    kD = gauss_kernel(D, kcb);
    % Sum over different features
    image_sift_feats(i,:) = sum(kD, 1);


The only parameter that can be changed in the Gaussian kernel is sigma. TODO: Analysis with different sigma

More sophisticated feature encoding schemes

I used vl_gmm to learn the means, covariances and priors of the sift features extracted from images. I then used these means, covariances and priors to find the fisher encoding of new images.


Using fisher encoding improves svm accuracy. It reduces the knn accuracy if we use approximate nearest neighbors. The data was to big for me to calculate actual k nearest neighbors.

Naive Bayes Nearest Neighbor

Boiman, Schechtman and Irani argue that when quantizing the sift features, we're throwing away a lot of information. Since knn has no learning phase unlike svm, their performance falls drastically with the quantized features. They propose to use the entire set of sift features and replace image to image distances with image to class distance. I'm using a simplified version which finds the nearest neighbors among all classes instead of among each class. This technique is called Local Naive Bayes Nearest Neighbors technique. The code for it is as follows.

    % Extract train image features
    for i = 1:size(train_image_paths, 1)
        temp = {};
        img = imread(train_image_paths{i});
        [~, image_feats] = vl_dsift(im2single(img), 'step', 25, 'fast');
        class_image_feats = [class_image_feats image_feats];
        [temp{1:size(image_feats, 2)}] = deal(train_labels{i});
        class_image_labels = [class_image_labels; temp'];
    train_image_feats = single(class_image_feats);
    kdforest = vl_kdtreebuild(train_image_feats, 'NumTrees', 4);

    % Extract test image features
    predicted_categories = cell(size(test_image_paths, 1), 1);
    for i = 1:size(test_image_paths, 1)
        img = imread(test_image_paths{i});
        [~, image_feats] = vl_dsift(im2single(img), 'step', 15, 'fast');
        totals = ones(15, 1).*inf;
        for feat_ind = 1:size(image_feats, 2)
            di = image_feats(:, feat_ind);
            [p_ind, ~] = vl_kdtreequery(kdforest, train_image_feats, single(di), 'NumNeighbors', k, 'MAXNUMCOMPARISONS', 10);
            dist_B = norm(single(di) - train_image_feats(:, p_ind(end)), 2);
            for cat_ind = 1:size(categories, 1)
                label = categories{cat_ind};
                p_label_mask = strcmp(label, class_image_labels(p_ind));
                if sum(p_label_mask) == 0
                dist_C = bsxfun(@minus, single(di), train_image_feats(:, p_ind(p_label_mask)));
                dist_C = min(arrayfun(@(idx) norm(dist_C(:, idx)), 1:size(dist_C, 2)));
                label_ind = find(strcmp(label, categories));
                if totals(label_ind) == inf
                    totals(label_ind) = 0;
                totals(label_ind) = totals(label_ind) + dist_C - dist_B;
        [~, min_class_ind] = min(totals);
        predicted_categories{i} = categories(min_class_ind);


Since the number of features extracted from each image is too high, I initially tried clustering the features to a more manageable number and ran the NBNN algorithm on these features. But this too loses a lot of information and the accuracy I obtained was 33%. Next, I tried using the entire set of features and k nearest neighbors and it took ~ 5 min to classify each image. Finally I tried using approximate nearest neighbors from the vl_feat library. As previously mentioned, approximate nearest neighbors also reducues the accuracy very much and my accuracy was only 63%. I was able to get a better accuracy using knn with sift and gist descritpors (66.7%).


I did k fold cross-validation. I divided the train data into n_fold number of parts. In each iteration, I trained on n_fold - 1 parts and validated on the left out part. In addition I randomly choose 100 test images and found out the accuracy on each of these. I have reported the average validation accuracy and average test accuracy and the standard deviations.

Parameter tuning

For this part, I pass in a set of parameter values to my cross-validation function. I find out the average validation accuracy on each parameter, and choose the parameter with the best average validation accuracy. This is just a simple grid search. I find that having a high k ( number of cross validation folds) improves the stability of my parameter search. For small values of k (<=5), I get different best parameter values for different runs.