In this project, we examine the task of scene recognition starting with very simple methods -- tiny images and nearest neighbor classification -- and then move on to more advanced methods -- bags of quantized local features and linear classifiers learned by support vector machines. The task is to classify scenes into one of 15 categories by training and testing on the 15 scene database (introduced in Lazebnik et al. 2006.
Example scenes from of each category in the 15 scene dataset. Figure from Lazebnik et al. 2006.
The tiny images method was proposed by Torralba, Fergus, and Freeman. In this method, all images are resized to 16x16. The image is then centered around the mean and normalized to prevent differences in intensities to affect us in the classification step. The 16x16 image is now our 1x256-sized feature vector.
Here, we use the nearest neighbor algorithm for classification. During training, we just compute the feature vectors of training images. During testing, we compute the feature vector for each test image and find its closest neighbor from among the feature vectors of training images using a distance measure (here we use L2 distance). We assign the test image to the same class as that of the nearest train image. This can be extended to k nearest neighbors and taking the mode to find the class.
The relevant code snippet is given below:
% Calculate full distance matrix of test images from every train image
dist_matrix = vl_alldist2(test_image_feats', train_image_feats');
% Get top k neighbors for each image
k = 1;
[~, idx] = sort(dist_matrix, 2);
top_k = train_labels(idx(:, 1:k));
predicted_categories = cell(size(test_image_feats, 1), 1);
for i = 1:size(top_k, 1)
% Vote using majority label of nearest neighbors for each test image
[unique_strings, ~, string_map] = unique(top_k(i, :));
predicted_categories(i) = unique_strings(mode(string_map));
end
For all the following sections, we report our results as an average of 10 runs where training and test images were randomly selected from the full set of images using cross validation (discussed later).
Tiny Image with Nearest Neighbor Classifier: Accuracy(%)
Mean | Standard Deviation |
---|---|
22.9% | 1.0% |
Confusion matrix for k=1
Bag of words models are a popular technique for image classification inspired by models used in natural language processing. The model ignores or downplays word arrangement (spatial information in the image) and classifies based on a histogram of the frequency of visual words. The visual word "vocabulary" is established by clustering a large corpus of local features.
Here, we represent an image as a bag of quantized SIFT features. Before we can represent our training and testing images as bag of feature histograms, we first need to establish a vocabulary of visual words. We form this vocabulary by sampling thousands of local features from our training set and then clustering them with kmeans. The number of kmeans clusters is the size of our vocabulary and the size of our features. This partitions the continuous, 128 dimensional SIFT feature space into k regions. For any new SIFT feature we observe, we can figure out which region it belongs to using the centroids of our original clusters. Those centroids are our visual word vocabulary.
The code to build our visual vocabulary is given below:
step = 8;
bin_size = 4;
if exist('sift_features.mat', 'file')
load('sift_features.mat');
else
fprintf('Generating SIFT features...\n');
rows = size(image_paths, 1);
sift_features = [];
% Get SIFT features for images using parallel for loop
parfor i = 1:rows
disp(rows - i)
img = im2single(imread(image_paths{i}));
[~, img_sift_features] = vl_dsift(img, 'fast', 'Step', step, 'size', bin_size);
sift_features = [sift_features img_sift_features];
end
save('sift_features.mat', 'sift_features');
end
% k-means to get vocabulary words
fprintf('Clustering to generate vocabulary...\n');
[centers, ~] = vl_kmeans(single(sift_features), vocab_size, 'MaxNumIterations', 250);
vocab = centers';
Now, we can represent our training and testing images as histograms of visual words. For each image, we densely sample many SIFT descriptors and simply count how many SIFT descriptors fall into each cluster in our visual word vocabulary. This is done by finding the nearest neighbor kmeans centroid for every SIFT feature. Thus, if we have a vocabulary of 50 visual words, and we detect 220 SIFT features in an image, our bag of SIFT representation will be a histogram of 50 dimensions where each bin counts how many times a SIFT descriptor was assigned to that cluster and sums to 220. The histogram should be normalized so that image size does not dramatically change the bag of feature magnitude.
The code to generate SIFT histograms is given below:
step = 8;
bin_size = 4;
n_Images = size(image_paths, 1);
image_feats = zeros(n_Images, vocab_size);
fprintf('\n Creating histogram features \n');
for i=1:n_Images
if mod(i, 100) == 0
disp(n_Images - i);
end
img = single(imread(char(image_paths(i))));
% Get sift features and sample 1000 of them, if more than 1000
[~, sift_features] = vl_dsift(img, 'fast', 'step', step, 'size', bin_size);
% Find nearest word in vocabulary using L2 distance
D = vl_alldist2(vocab', single(sift_features));
[~, I] = min(D);
% Get histogram counts
image_feats(i, :) = histcounts(I, 1:vocab_size+1);
end
% Normalize across rows (L2)
image_feats = bsxfun(@rdivide, image_feats, sqrt(sum(image_feats.^2, 2)));
We now measure how well our bag of SIFT representation works when paired with a nearest neighbor classifier.
Bag of SIFT with Nearest Neighbor Classifier: Accuracy(%)
Mean | Standard Deviation |
---|---|
53.9% | 2.2% |
Confusion matrix for SIFT histogram feature representation and nearest neighbor classifier
Finally, we train 1-vs-all linear SVMS to operate in the bag of SIFT feature space. Linear classifiers are one of the simplest possible learning models. The feature space is partitioned by a learned hyperplane and test cases are categorized based on which side of that hyperplane they fall on. Despite this model being far less expressive than the nearest neighbor classifier, it will often perform better. For example, maybe in our bag of SIFT representation 40 of the 50 visual words are uninformative. They simply don't help us make a decision about whether an image is a 'forest' or a 'bedroom'. Perhaps they represent smooth patches, gradients, or step edges which occur in all types of scenes. The prediction from a nearest neighbor classifier will still be heavily influenced by these frequent visual words, whereas a linear classifier can learn that those dimensions of the feature vector are less relevant and thus downweight them when making a decision.
Here, we find linear decision boundaries with support vector machines. However, linear classifiers are inherently binary and we have a 15-way classification problem. To decide which of 15 categories a test case belongs to, we train 15 binary, 1-vs-all SVMs. 1-vs-all means that each classifier will be trained to recognize 'forest' vs 'non-forest', 'kitchen' vs 'non-kitchen', etc. All 15 classifiers will be evaluated on each test case and the classifier which is most confidently positive "wins". E.g. if the 'kitchen' classifier returns a score of -0.2 (where 0 is on the decision boundary), and the 'forest' classifier returns a score of -0.3, and all of the other classifiers are even more negative, the test case would be classified as a kitchen even though none of the classifiers put the test case on the positive side of the decision boundary. When learning a linear SVM, we have a free parameter 'lambda' which controls how strongly regularized the model is.
The code for the linear multi-class SVM is given below:
categories = unique(train_labels);
num_categories = length(categories);
lambda = 0.00005;
[num_test, d] = size(test_image_feats);
svm_params = zeros(num_categories, d + 1);
X = train_image_feats';
% Train one-vs-all SVMs for all categories
for i = 1:num_categories
% Create labels
Y = double((strcmp(categories(i), train_labels) * 2) - 1);
[W, B] = vl_svmtrain(X, Y, lambda);
svm_params(i, :) = [W' B];
end
% Predict class with highest score
test_image_feats = [test_image_feats ones(num_test, 1)];
pred = svm_params * test_image_feats';
[~, idx] = max(pred, [], 1);
predicted_categories = categories(idx');
Bag of SIFT with linear SVM: Accuracy(%)
Mean | Standard Deviation |
---|---|
67.0% | 1.4% |
Confusion matrix for SIFT histogram feature representation and linear SVM classifier
This section dealt with multiple free paramters, some of which greatly impact performance. Here, we provide a summary of accuracy while varying these parameters.
k | Accuracy (%) |
---|---|
1 | 22.9 |
2 | 23.1 |
5 | 22.4 |
10 | 24.0 |
20 | 22.1 |
50 | 20.4 |
There is no signficant pattern here and accuracy does not seem to improve by increasing k. This is likely due to the noisy representation which fails to separate classes.
k | Step Size | Bin Size | Accuracy (%) |
---|---|---|---|
1 | 2 | 8 | 50.2 |
5 | 2 | 8 | 51.3 |
10 | 2 | 8 | 48.9 |
1 | 4 | 8 | 53.9 |
5 | 4 | 8 | 53.5 |
10 | 4 | 8 | 51.7 |
1 | 2 | 16 | 49.7 |
5 | 2 | 16 | 49.3 |
10 | 2 | 16 | 48.7 |
1 | 4 | 16 | 53.5 |
5 | 4 | 16 | 52.7 |
10 | 4 | 16 | 52.5 |
Increasing size seemed to increase accuracy up to a point after which the accuracy went down. On the other hand, increasing step did not affect accuracy too much and increased speed by a lot. Thus, we chose to go for a moderate value of size and relatively high value of step.
lambda | Accuracy (%) |
---|---|
0.000001 | 60.5 |
0.000005 | 63.5 |
0.00001 | 64.1 |
0.00005 | 67.0 |
0.0001 | 66.4 |
Here, we opted to stick with the optimal step and size values from above and tried to select lambda which affected accuracy significantly.
k | 1 |
Size | 4 |
Step | 8 |
Lambda | 0.00005 |
The bag of words descriptor throws away all spatial information and just looks for the presence of certain words at particular frequencies in the image to perform classification. The GIST descriptor also utilizes spatial information and thus should lead to a better represenation and thus accuracy in classification. The GIST descriptor proposes a set of perceptual dimensions (naturalness, openness, roughness, expansion, ruggedness) that represent the dominant spatial structure of a scene. These dimensions can be reliably estimated using spectral and coarsely localized information. We use code from the above link to compute the gist descriptors for images. Here, we use GIST descriptors as complementary features by combining them with SIFT descriptor histograms we found earlier.
The vocabulary is still computed from just the SIFT features, and the SIFT feature for an image is just the visual word histogram from above. We separately normalize the SIFT histogram and GIST descriptors and concatenate them to get our feature representation. Finally, we normalize the feature represenation once more.
The code for the GIST is given below:
clear param
param.imageSize = [256 256];
param.orientationsPerScale = [8 8 8 8];
param.numberBlocks = 4;
param.fc_prefilt = 4;
parfor i = 1:n_Images
...
histg = histcounts(idx, hist_edges);
histg = histg / sqrt(sum(histg.^2, 2));
% GIST
gist = LMgist(img, '', param);
gist = gist / sqrt(sum(gist.^2, 2));
image_feats(i, :) = [histg gist];
end
image_feats = bsxfun(@rdivide, image_feats, sqrt(sum(image_feats.^2, 2)));
GIST descriptors + Bag of SIFT with linear SVM: Accuracy(%)
Mean | Standard Deviation |
---|---|
77.1% | 1.7% |
Confusion matrix for GIST and SIFT histogram feature representation
Self-similarity measures took too long to compute at around ~20 seconds per image and thus were not used.
Here, we use "soft assignment" to assign visual words to histogram bins. Each visual word will cast a distance-weighted vote to multiple bins. We use exponentiation with a scaled Gaussian distance matrix to avoid irregularities such as infinity weight to a vocabulary word. The feature representation is the sum of the votes of each visual work in the image to each visual word in the dictionary. This is effectively interpolating visual words which do not clearly belong to some cluster.
The code for kernel codebook encoding is given below:
...
sigma = 125; % For distance Gaussian matrix
for i=1:n_Images
...
D = vl_alldist2(vocab', single(sift_features));
% Get weighted distance matrix
D = exp(-1 * D / (2 * sigma * sigma));
% Get weighted counts
image_feats(i, :) = sum(D, 2);
end
% Normalize across rows
image_feats = bsxfun(@rdivide, image_feats, sqrt(sum(image_feats.^2, 2)));
Kernel codebook encoding of SIFT features with linear SVM: Accuracy(%)
Mean | Standard Deviation |
---|---|
63.1% | 1.8% |
Confusion matrix for kernel codebook encoding
If a probability density function (in our case a Gaussian Mixture Model) is used to model the visual vocabulary, we can compute the gradient of the log likelihood with respect to the parameters of the model to represent an image. The Fisher Vector is the concatenation of these partial derivatives and describes in which direction the parameters of the model should be modified to best fit the data. This representation has the advantage to give similar or even better classification performance than BOV obtained with supervised visual vocabularies, being at the same time class independent.
Here, we learn a Gaussian Mixture Model of visual words using the set of training images. During testing, we get the Fisher encoding which is the concatenation of the normalizeed gradient with respect to each of the means and standard deviations. This method takes a modest amount of time to learn the GMM but once that is done, getting the encoding is much faster than dense SIFT.
The code for Fisher encoding is given below:
...
gmm_clusters = 256;
...
% Learn GMM; assumes you have run build_vocabulary already
[means, covariance, priors] = vl_gmm(double(sift_features), gmm_clusters);
parfor i=1:n_Images
...
% Get Fisher encoding
encoding = vl_fisher(double(SIFT_features), means, covariance, priors);
image_feats(i, :) = encoding;
end
% Normalize across rows
image_feats = bsxfun(@rdivide, image_feats, sqrt(sum(image_feats.^2, 2)));
Fisher encoding with linear SVM: Accuracy(%)
Mean | Standard Deviation |
---|---|
75.9% | 1.6% |
Confusion matrix for Fisher Encoding
In addition to the linear SVM discussed above, we also tried SVMs with polynomial and radial basis function kernels to introduce non-linear decision boundaries. For this, we used Olivier Chapelle's MATLAB code to train primal SVMs using kernel matrices.
Bag of SIFT with SVM with radial basis function kernel: Accuracy(%)
sigma = 200
Mean | Standard Deviation |
---|---|
67.3% | 1.7% |
Confusion matrix for RBF kernel
Bag of SIFT with SVM with polynomial kernel: Accuracy(%)
bias = 1, exponent = 3
Mean | Standard Deviation |
---|---|
69.5% | 1.7% |
Confusion matrix for polynomial kernel
Spatial pyramid was proposed by Lazebnik et al 2006 as a way to capture the spatial features of an image which are ignored by the bag of features approach. To build a spatial pyramid, we divide the image into different sub-regions and compute the SIFT histograms in each region. This method thus complements bag of SIFT method. We concatenate the existing SIFT features with the features provided by the spatial pyramid scheme since both of them represent complementary ascepts of the image.
The code for Spatial Pyramid is given below:
...
image_feats = zeros(n_Images, vocab_size * 21);
hist_edges = 1:(vocab_size + 1);
parfor i = 1:n_Images
...
[locations, sift_features] = vl_dsift(img, 'fast', 'step', step, 'size', bin_size);
D = vl_alldist2(vocab', single(sift_features));
[~, idx] = min(D);
% Level 0 features (SIFT)
pyramid_hist = histcounts(idx, hist_edges);
% Spatial Pyramid
% Level 1
for r = 0:1
for c = 0:1
filt_idx = filter_locations(locations, [1 + r * (n / 2), (r + 1) * (n / 2)], [1 + c * (n / 2), (c + 1) * (n / 2)]);
next_level = histcounts(idx(filt_idx), hist_edges);
pyramid_hist = [pyramid_hist next_level];
end
end
% Level 2
for r = 0:3
for c = 0:3
filt_idx = filter_locations(locations, [1 + r * (n / 4), (r + 1) * (n / 4)], [1 + c * (n / 4), (c + 1) * (n / 4)]);
next_level = histcounts(idx(filt_idx), hist_edges);
pyramid_hist = [pyramid_hist next_level];
end
end
image_feats(i, :) = pyramid_hist;
end
image_feats = bsxfun(@rdivide, image_feats, sqrt(sum(image_feats.^2, 2)));
Spatial pyramid + Bag of SIFT with linear SVM: Accuracy(%)
Mean | Standard Deviation |
---|---|
75.5% | 1.6% |
Confusion matrix for spatial pyramid
We implemented cross-validation to randomize training and test sets for better evaluation. We sampled 100 images each from train and test folders for each category and shuffled them to get the new train and test splits. The results of cross-validation are evident in that we have mean and standard deviation instead of just a singular value for accuracy in each of the previous sections. For calculating the mean and standard deviation, we ran the experiments for 10 iterations each.
The code for cross-validation is given below:
...
% Randomly sample files including files beyond 100th in lexicographical order
images = dir( fullfile(data_path, 'train', categories{i}, '*.jpg'));
idx = datasample(1:length(images), num_train_per_cat, 'Replace', false);
...
% Already have 1500 each of train and test images and labels
image_paths = [train_image_paths; test_image_paths];
labels = [train_labels; test_labels];
c = cvpartition(labels,'KFold',2);
train_idx = training(c, 1);
test_idx = test(c, 1);
train_image_paths = image_paths(train_idx);
test_image_paths = image_paths(test_idx);
train_labels = labels(train_idx);
test_labels = labels(test_idx);
We can vary the vocabulary size by changing the number of k-means clusters while we are building the vocabulary. As expected, a larger vocabulary improves performance at the cost of additional computation. However, we notice a degradation for very high vocabulary sizes as a result of confusion between very similar words in the vocabulary. We finally decided to select a vocabulary size of 400 as it represented a good tradeoff in terms of performance vs. runtime.
Since vocabulary building takes a lot of time, we only present results for bag of SIFT with linear SVM classifier.
Bag of SIFT with linear SVM
Vocabulary Size | Accuracy(%) |
---|---|
10 | 50.6% |
50 | 60.7% |
200 | 65.6% |
400 | 67.0% |
1000 | 67.6% |
10000 | 65.2% |
First, we present a summary of our results so far and some combinations of feature representations we have not yet discussed.
Summary of results: Accuracy(%)
Vocabulary Size | Mean |
---|---|
10 | 50.6% |
50 | 60.7% |
200 | 65.6% |
400 | 67.0% |
1000 | 67.6% |
10000 | 65.2% |
Finally, we combined all these complementary feature representations to get the best performance. This leads to a high dimensional feature representation. To speed up computation, using PCA to reduce dimensionality at the cost of slightly lower accuracy seems like a good tradeoff.
SIFT + GIST + Spatial Pyramid + Fisher encoding with linear SVM: Accuracy(%)
Pipeline | Mean | Standard Deviation |
---|---|---|
Tiny Image with Nearest Neighbor Classifier | 22.9% | 1.0% |
Bag of SIFT with Nearest Neighbor Classifier | 53.9% | 2.2% |
Bag of SIFT with linear SVM | 67.0% | 1.4% |
Bag of SIFT with SVM with radial basis function kernel | 67.3% | 1.7% |
Bag of SIFT with SVM with polynomial kernel | 69.5% | 1.7% |
Kernel codebook encoding of SIFT features with linear SVM | 63.1% | 1.8% |
Fisher encoding with linear SVM | 75.9% | 1.6% |
GIST descriptors + Bag of SIFT with linear SVM | 77.1% | 1.7% |
Spatial pyramid + Bag of SIFT with linear SVM | 75.5% | 1.6% |
SIFT + GIST + Spatial Pyramid with linear SVM | 80.2% | 1.6% |
SIFT + GIST + Spatial Pyramid + Fisher encoding with linear SVM | 83.3% | 1.7% |
Kernel SVMs do not do significantly better than a linear SVM. This can likely be fixed by further tuning of hyperparameters for the kernels. Adding spatial pyramid or GIST leads to significantly better performance. Fisher encoding is a faster alternative to dense SIFT and performs much better than bag of SIFT features. Finally, combining several sets of complementary features leads to the best performance. We present the detailed results for the best combination.
Confusion matrix for best combination
Category name | Accuracy | Sample training images | Sample true positives | False positives with true label | False negatives with wrong predicted label | ||||
---|---|---|---|---|---|---|---|---|---|
Kitchen | 0.680 | LivingRoom |
Store |
Office |
InsideCity |
||||
Store | 0.720 | LivingRoom |
Kitchen |
Kitchen |
Kitchen |
||||
Bedroom | 0.710 | LivingRoom |
LivingRoom |
LivingRoom |
Industrial |
||||
LivingRoom | 0.680 | Bedroom |
Bedroom |
Kitchen |
Bedroom |
||||
Office | 0.950 | Kitchen |
Bedroom |
Kitchen |
Bedroom |
||||
Industrial | 0.820 | Bedroom |
Store |
Store |
Bedroom |
||||
Suburb | 1.000 | Bedroom |
Industrial |
||||||
InsideCity | 0.870 | Street |
Store |
Kitchen |
TallBuilding |
||||
TallBuilding | 0.910 | Store |
InsideCity |
OpenCountry |
InsideCity |
||||
Street | 0.900 | InsideCity |
InsideCity |
Industrial |
InsideCity |
||||
Highway | 0.870 | Store |
Store |
LivingRoom |
Coast |
||||
OpenCountry | 0.720 | Coast |
Coast |
Highway |
Coast |
||||
Coast | 0.850 | TallBuilding |
OpenCountry |
OpenCountry |
OpenCountry |
||||
Mountain | 0.880 | TallBuilding |
Industrial |
OpenCountry |
Street |
||||
Forest | 0.940 | TallBuilding |
OpenCountry |
Mountain |
TallBuilding |
||||
Category name | Accuracy | Sample training images | Sample true positives | False positives with true label | False negatives with wrong predicted label |