Project 4 / Scene Recognition with Bag of Words

Overview

In this project, we examine the task of scene recognition starting with very simple methods -- tiny images and nearest neighbor classification -- and then move on to more advanced methods -- bags of quantized local features and linear classifiers learned by support vector machines. The task is to classify scenes into one of 15 categories by training and testing on the 15 scene database (introduced in Lazebnik et al. 2006.

Example scenes from of each category in the 15 scene dataset. Figure from Lazebnik et al. 2006.

Tiny Image with Nearest Neighbors

The tiny images method was proposed by Torralba, Fergus, and Freeman. In this method, all images are resized to 16x16. The image is then centered around the mean and normalized to prevent differences in intensities to affect us in the classification step. The 16x16 image is now our 1x256-sized feature vector.

Here, we use the nearest neighbor algorithm for classification. During training, we just compute the feature vectors of training images. During testing, we compute the feature vector for each test image and find its closest neighbor from among the feature vectors of training images using a distance measure (here we use L2 distance). We assign the test image to the same class as that of the nearest train image. This can be extended to k nearest neighbors and taking the mode to find the class.

The relevant code snippet is given below:


% Calculate full distance matrix of test images from every train image
dist_matrix = vl_alldist2(test_image_feats', train_image_feats');

% Get top k neighbors for each image
k = 1;
[~, idx] = sort(dist_matrix, 2);
top_k = train_labels(idx(:, 1:k));

predicted_categories = cell(size(test_image_feats, 1), 1);
for i = 1:size(top_k, 1)
    % Vote using majority label of nearest neighbors for each test image
    [unique_strings, ~, string_map] = unique(top_k(i, :));
    predicted_categories(i) = unique_strings(mode(string_map));
end

For all the following sections, we report our results as an average of 10 runs where training and test images were randomly selected from the full set of images using cross validation (discussed later).

Tiny Image with Nearest Neighbor Classifier: Accuracy(%)

MeanStandard Deviation
22.9%1.0%

Confusion matrix for k=1

Bag of SIFT with Nearest Neighbors

Bag of words models are a popular technique for image classification inspired by models used in natural language processing. The model ignores or downplays word arrangement (spatial information in the image) and classifies based on a histogram of the frequency of visual words. The visual word "vocabulary" is established by clustering a large corpus of local features.

Here, we represent an image as a bag of quantized SIFT features. Before we can represent our training and testing images as bag of feature histograms, we first need to establish a vocabulary of visual words. We form this vocabulary by sampling thousands of local features from our training set and then clustering them with kmeans. The number of kmeans clusters is the size of our vocabulary and the size of our features. This partitions the continuous, 128 dimensional SIFT feature space into k regions. For any new SIFT feature we observe, we can figure out which region it belongs to using the centroids of our original clusters. Those centroids are our visual word vocabulary.

The code to build our visual vocabulary is given below:


step = 8;
bin_size = 4;

if exist('sift_features.mat', 'file')
     load('sift_features.mat');
else
    fprintf('Generating SIFT features...\n');
    rows = size(image_paths, 1);
    sift_features = [];
    % Get SIFT features for images using parallel for loop
    parfor i = 1:rows
        disp(rows - i)
        img = im2single(imread(image_paths{i}));
        [~, img_sift_features] = vl_dsift(img, 'fast', 'Step', step, 'size', bin_size);
        sift_features = [sift_features img_sift_features];
    end
    save('sift_features.mat', 'sift_features');
end

% k-means to get vocabulary words
fprintf('Clustering to generate vocabulary...\n');
[centers, ~] = vl_kmeans(single(sift_features), vocab_size, 'MaxNumIterations', 250);
vocab = centers';

Now, we can represent our training and testing images as histograms of visual words. For each image, we densely sample many SIFT descriptors and simply count how many SIFT descriptors fall into each cluster in our visual word vocabulary. This is done by finding the nearest neighbor kmeans centroid for every SIFT feature. Thus, if we have a vocabulary of 50 visual words, and we detect 220 SIFT features in an image, our bag of SIFT representation will be a histogram of 50 dimensions where each bin counts how many times a SIFT descriptor was assigned to that cluster and sums to 220. The histogram should be normalized so that image size does not dramatically change the bag of feature magnitude.

The code to generate SIFT histograms is given below:


step = 8;
bin_size = 4;

n_Images = size(image_paths, 1);
image_feats = zeros(n_Images, vocab_size);

fprintf('\n Creating histogram features \n');
for i=1:n_Images
    if mod(i, 100) == 0
        disp(n_Images - i);
    end
    img = single(imread(char(image_paths(i))));
    % Get sift features and sample 1000 of them, if more than 1000
    [~, sift_features] = vl_dsift(img, 'fast', 'step', step, 'size', bin_size);
    % Find nearest word in vocabulary using L2 distance
    D = vl_alldist2(vocab', single(sift_features));
    [~, I] = min(D);
    % Get histogram counts
    image_feats(i, :) = histcounts(I, 1:vocab_size+1);
end
% Normalize across rows (L2)
image_feats = bsxfun(@rdivide, image_feats, sqrt(sum(image_feats.^2, 2)));

We now measure how well our bag of SIFT representation works when paired with a nearest neighbor classifier.

Bag of SIFT with Nearest Neighbor Classifier: Accuracy(%)

MeanStandard Deviation
53.9%2.2%

Confusion matrix for SIFT histogram feature representation and nearest neighbor classifier

Finally, we train 1-vs-all linear SVMS to operate in the bag of SIFT feature space. Linear classifiers are one of the simplest possible learning models. The feature space is partitioned by a learned hyperplane and test cases are categorized based on which side of that hyperplane they fall on. Despite this model being far less expressive than the nearest neighbor classifier, it will often perform better. For example, maybe in our bag of SIFT representation 40 of the 50 visual words are uninformative. They simply don't help us make a decision about whether an image is a 'forest' or a 'bedroom'. Perhaps they represent smooth patches, gradients, or step edges which occur in all types of scenes. The prediction from a nearest neighbor classifier will still be heavily influenced by these frequent visual words, whereas a linear classifier can learn that those dimensions of the feature vector are less relevant and thus downweight them when making a decision.

Here, we find linear decision boundaries with support vector machines. However, linear classifiers are inherently binary and we have a 15-way classification problem. To decide which of 15 categories a test case belongs to, we train 15 binary, 1-vs-all SVMs. 1-vs-all means that each classifier will be trained to recognize 'forest' vs 'non-forest', 'kitchen' vs 'non-kitchen', etc. All 15 classifiers will be evaluated on each test case and the classifier which is most confidently positive "wins". E.g. if the 'kitchen' classifier returns a score of -0.2 (where 0 is on the decision boundary), and the 'forest' classifier returns a score of -0.3, and all of the other classifiers are even more negative, the test case would be classified as a kitchen even though none of the classifiers put the test case on the positive side of the decision boundary. When learning a linear SVM, we have a free parameter 'lambda' which controls how strongly regularized the model is.

The code for the linear multi-class SVM is given below:


categories = unique(train_labels); 
num_categories = length(categories);

lambda = 0.00005;
[num_test, d] = size(test_image_feats);

svm_params = zeros(num_categories, d + 1);
X = train_image_feats';
% Train one-vs-all SVMs for all categories
for i = 1:num_categories
    % Create labels
    Y = double((strcmp(categories(i), train_labels) * 2) - 1);
    [W, B] = vl_svmtrain(X, Y, lambda);
    svm_params(i, :) = [W' B];
end

% Predict class with highest score
test_image_feats = [test_image_feats ones(num_test, 1)];
pred = svm_params * test_image_feats';

[~, idx] = max(pred, [], 1);
predicted_categories = categories(idx');

Bag of SIFT with linear SVM: Accuracy(%)

MeanStandard Deviation
67.0%1.4%

Confusion matrix for SIFT histogram feature representation and linear SVM classifier

Parameter Selection

This section dealt with multiple free paramters, some of which greatly impact performance. Here, we provide a summary of accuracy while varying these parameters.

Tiny Image with Nearest Neighbor Classifier: Varying k
kAccuracy (%)
122.9
223.1
522.4
1024.0
2022.1
5020.4

There is no signficant pattern here and accuracy does not seem to improve by increasing k. This is likely due to the noisy representation which fails to separate classes.

Bag of SIFT with Nearest Neighbor Classifier: Varying k, step and size
kStep SizeBin SizeAccuracy (%)
12850.2
52851.3
102848.9
14853.9
54853.5
104851.7
121649.7
521649.3
1021648.7
141653.5
541652.7
1041652.5

Increasing size seemed to increase accuracy up to a point after which the accuracy went down. On the other hand, increasing step did not affect accuracy too much and increased speed by a lot. Thus, we chose to go for a moderate value of size and relatively high value of step.

Bag of SIFT with SVM: Varying lambda
lambdaAccuracy (%)
0.00000160.5
0.00000563.5
0.0000164.1
0.0000567.0
0.000166.4

Here, we opted to stick with the optimal step and size values from above and tried to select lambda which affected accuracy significantly.

Optimal parameters
k1
Size4
Step8
Lambda0.00005

Graduate/Extra Credit

Feature representation: GIST Descriptors

The bag of words descriptor throws away all spatial information and just looks for the presence of certain words at particular frequencies in the image to perform classification. The GIST descriptor also utilizes spatial information and thus should lead to a better represenation and thus accuracy in classification. The GIST descriptor proposes a set of perceptual dimensions (naturalness, openness, roughness, expansion, ruggedness) that represent the dominant spatial structure of a scene. These dimensions can be reliably estimated using spectral and coarsely localized information. We use code from the above link to compute the gist descriptors for images. Here, we use GIST descriptors as complementary features by combining them with SIFT descriptor histograms we found earlier.

The vocabulary is still computed from just the SIFT features, and the SIFT feature for an image is just the visual word histogram from above. We separately normalize the SIFT histogram and GIST descriptors and concatenate them to get our feature representation. Finally, we normalize the feature represenation once more.

The code for the GIST is given below:


clear param
param.imageSize = [256 256]; 
param.orientationsPerScale = [8 8 8 8];
param.numberBlocks = 4;
param.fc_prefilt = 4;

parfor i = 1:n_Images
    ...
    histg = histcounts(idx, hist_edges);
    histg = histg / sqrt(sum(histg.^2, 2));
    
    % GIST
    gist = LMgist(img, '', param);
    gist = gist / sqrt(sum(gist.^2, 2));
    
    image_feats(i, :) = [histg gist];
end
image_feats = bsxfun(@rdivide, image_feats, sqrt(sum(image_feats.^2, 2)));

GIST descriptors + Bag of SIFT with linear SVM: Accuracy(%)

MeanStandard Deviation
77.1%1.7%

Confusion matrix for GIST and SIFT histogram feature representation

Self-similarity measures took too long to compute at around ~20 seconds per image and thus were not used.

Feature quantization and bag of words: Kernel Codebook Encoding

Here, we use "soft assignment" to assign visual words to histogram bins. Each visual word will cast a distance-weighted vote to multiple bins. We use exponentiation with a scaled Gaussian distance matrix to avoid irregularities such as infinity weight to a vocabulary word. The feature representation is the sum of the votes of each visual work in the image to each visual word in the dictionary. This is effectively interpolating visual words which do not clearly belong to some cluster.

The code for kernel codebook encoding is given below:


...
sigma = 125; % For distance Gaussian matrix
for i=1:n_Images
    ...
    D = vl_alldist2(vocab', single(sift_features));
    % Get weighted distance matrix
    D = exp(-1 * D / (2 * sigma * sigma));
    % Get weighted counts
    image_feats(i, :) = sum(D, 2);
end
% Normalize across rows
image_feats = bsxfun(@rdivide, image_feats, sqrt(sum(image_feats.^2, 2)));

Kernel codebook encoding of SIFT features with linear SVM: Accuracy(%)

MeanStandard Deviation
63.1%1.8%

Confusion matrix for kernel codebook encoding

Feature quantization and bag of words: Fisher Encoding

If a probability density function (in our case a Gaussian Mixture Model) is used to model the visual vocabulary, we can compute the gradient of the log likelihood with respect to the parameters of the model to represent an image. The Fisher Vector is the concatenation of these partial derivatives and describes in which direction the parameters of the model should be modified to best fit the data. This representation has the advantage to give similar or even better classification performance than BOV obtained with supervised visual vocabularies, being at the same time class independent.

Here, we learn a Gaussian Mixture Model of visual words using the set of training images. During testing, we get the Fisher encoding which is the concatenation of the normalizeed gradient with respect to each of the means and standard deviations. This method takes a modest amount of time to learn the GMM but once that is done, getting the encoding is much faster than dense SIFT.

The code for Fisher encoding is given below:


...
gmm_clusters = 256;

...
% Learn GMM; assumes you have run build_vocabulary already
[means, covariance, priors] = vl_gmm(double(sift_features), gmm_clusters);
    
parfor i=1:n_Images
	...
    % Get Fisher encoding
    encoding = vl_fisher(double(SIFT_features), means, covariance, priors);
    image_feats(i, :) = encoding;
end
% Normalize across rows
image_feats = bsxfun(@rdivide, image_feats, sqrt(sum(image_feats.^2, 2)));

Fisher encoding with linear SVM: Accuracy(%)

MeanStandard Deviation
75.9%1.6%

Confusion matrix for Fisher Encoding

Classifier: Kernel SVM

In addition to the linear SVM discussed above, we also tried SVMs with polynomial and radial basis function kernels to introduce non-linear decision boundaries. For this, we used Olivier Chapelle's MATLAB code to train primal SVMs using kernel matrices.

Bag of SIFT with SVM with radial basis function kernel: Accuracy(%)
sigma = 200

MeanStandard Deviation
67.3%1.7%

Confusion matrix for RBF kernel

Bag of SIFT with SVM with polynomial kernel: Accuracy(%)
bias = 1, exponent = 3

MeanStandard Deviation
69.5%1.7%

Confusion matrix for polynomial kernel

Spatial Pyramid: Feature representation

Spatial pyramid was proposed by Lazebnik et al 2006 as a way to capture the spatial features of an image which are ignored by the bag of features approach. To build a spatial pyramid, we divide the image into different sub-regions and compute the SIFT histograms in each region. This method thus complements bag of SIFT method. We concatenate the existing SIFT features with the features provided by the spatial pyramid scheme since both of them represent complementary ascepts of the image.

The code for Spatial Pyramid is given below:


...
image_feats = zeros(n_Images, vocab_size * 21);
hist_edges = 1:(vocab_size + 1);

parfor i = 1:n_Images
    ...
    [locations, sift_features] = vl_dsift(img, 'fast', 'step', step, 'size', bin_size);
    D = vl_alldist2(vocab', single(sift_features));
    [~, idx] = min(D);
    
    % Level 0 features (SIFT)
    pyramid_hist = histcounts(idx, hist_edges);

    % Spatial Pyramid
    % Level 1
    for r = 0:1
        for c = 0:1
            filt_idx = filter_locations(locations, [1 + r * (n / 2), (r + 1) * (n / 2)], [1 + c * (n / 2), (c + 1) * (n / 2)]);
            next_level = histcounts(idx(filt_idx), hist_edges);
            pyramid_hist = [pyramid_hist next_level];
        end
    end
    
    % Level 2
    for r = 0:3
        for c = 0:3
            filt_idx = filter_locations(locations, [1 + r * (n / 4), (r + 1) * (n / 4)], [1 + c * (n / 4), (c + 1) * (n / 4)]);
            next_level = histcounts(idx(filt_idx), hist_edges);
            pyramid_hist = [pyramid_hist next_level];
        end
    end
    
    image_feats(i, :) = pyramid_hist;
end
image_feats = bsxfun(@rdivide, image_feats, sqrt(sum(image_feats.^2, 2)));

Spatial pyramid + Bag of SIFT with linear SVM: Accuracy(%)

MeanStandard Deviation
75.5%1.6%

Confusion matrix for spatial pyramid

Experimental Design: Cross-validation

We implemented cross-validation to randomize training and test sets for better evaluation. We sampled 100 images each from train and test folders for each category and shuffled them to get the new train and test splits. The results of cross-validation are evident in that we have mean and standard deviation instead of just a singular value for accuracy in each of the previous sections. For calculating the mean and standard deviation, we ran the experiments for 10 iterations each.

The code for cross-validation is given below:


...
% Randomly sample files including files beyond 100th in lexicographical order
images = dir( fullfile(data_path, 'train', categories{i}, '*.jpg'));
idx = datasample(1:length(images), num_train_per_cat, 'Replace', false);	
...
% Already have 1500 each of train and test images and labels
image_paths = [train_image_paths; test_image_paths];
labels = [train_labels; test_labels];
c = cvpartition(labels,'KFold',2);
train_idx = training(c, 1);
test_idx = test(c, 1);
train_image_paths = image_paths(train_idx);
test_image_paths = image_paths(test_idx);
train_labels = labels(train_idx);
test_labels = labels(test_idx);

Experimental Design: Varying Vocabulary Size

We can vary the vocabulary size by changing the number of k-means clusters while we are building the vocabulary. As expected, a larger vocabulary improves performance at the cost of additional computation. However, we notice a degradation for very high vocabulary sizes as a result of confusion between very similar words in the vocabulary. We finally decided to select a vocabulary size of 400 as it represented a good tradeoff in terms of performance vs. runtime.

Since vocabulary building takes a lot of time, we only present results for bag of SIFT with linear SVM classifier.

Bag of SIFT with linear SVM

Vocabulary SizeAccuracy(%)
1050.6%
5060.7%
20065.6%
40067.0%
100067.6%
1000065.2%

Discussion

First, we present a summary of our results so far and some combinations of feature representations we have not yet discussed.

Summary of results: Accuracy(%)

Vocabulary SizeMean
1050.6%
5060.7%
20065.6%
40067.0%
100067.6%
1000065.2%

SIFT + GIST + Spatial Pyramid + Fisher encoding

Finally, we combined all these complementary feature representations to get the best performance. This leads to a high dimensional feature representation. To speed up computation, using PCA to reduce dimensionality at the cost of slightly lower accuracy seems like a good tradeoff.

SIFT + GIST + Spatial Pyramid + Fisher encoding with linear SVM: Accuracy(%)

PipelineMeanStandard Deviation
Tiny Image with Nearest Neighbor Classifier22.9%1.0%
Bag of SIFT with Nearest Neighbor Classifier53.9%2.2%
Bag of SIFT with linear SVM67.0%1.4%
Bag of SIFT with SVM with radial basis function kernel67.3%1.7%
Bag of SIFT with SVM with polynomial kernel69.5%1.7%
Kernel codebook encoding of SIFT features with linear SVM63.1%1.8%
Fisher encoding with linear SVM75.9%1.6%
GIST descriptors + Bag of SIFT with linear SVM77.1%1.7%
Spatial pyramid + Bag of SIFT with linear SVM75.5%1.6%
SIFT + GIST + Spatial Pyramid with linear SVM80.2%1.6%
SIFT + GIST + Spatial Pyramid + Fisher encoding with linear SVM83.3%1.7%

Kernel SVMs do not do significantly better than a linear SVM. This can likely be fixed by further tuning of hyperparameters for the kernels. Adding spatial pyramid or GIST leads to significantly better performance. Fisher encoding is a faster alternative to dense SIFT and performs much better than bag of SIFT features. Finally, combining several sets of complementary features leads to the best performance. We present the detailed results for the best combination.

Confusion matrix for best combination

Category name Accuracy Sample training images Sample true positives False positives with true label False negatives with wrong predicted label
Kitchen 0.680
LivingRoom

Store

Office

InsideCity
Store 0.720
LivingRoom

Kitchen

Kitchen

Kitchen
Bedroom 0.710
LivingRoom

LivingRoom

LivingRoom

Industrial
LivingRoom 0.680
Bedroom

Bedroom

Kitchen

Bedroom
Office 0.950
Kitchen

Bedroom

Kitchen

Bedroom
Industrial 0.820
Bedroom

Store

Store

Bedroom
Suburb 1.000
Bedroom

Industrial
InsideCity 0.870
Street

Store

Kitchen

TallBuilding
TallBuilding 0.910
Store

InsideCity

OpenCountry

InsideCity
Street 0.900
InsideCity

InsideCity

Industrial

InsideCity
Highway 0.870
Store

Store

LivingRoom

Coast
OpenCountry 0.720
Coast

Coast

Highway

Coast
Coast 0.850
TallBuilding

OpenCountry

OpenCountry

OpenCountry
Mountain 0.880
TallBuilding

Industrial

OpenCountry

Street
Forest 0.940
TallBuilding

OpenCountry

Mountain

TallBuilding
Category name Accuracy Sample training images Sample true positives False positives with true label False negatives with wrong predicted label