Project 4 / Scene Recognition with Bag of Words

Confusion Matrix of SIFT+SVM

The goal of this project is to introduce us to image recognition. Specifically, we will examine the task of scene recognition starting with very simple methods -- tiny images and nearest neighbor classification -- and then move on to more advanced methods -- bags of quantized local features and linear classifiers learned by support vector machines. My work in this project consists of the following three major parts.

  1. Tiny images representation and nearest neighbour classifier.
  2. Bag of SIFT representation and nearest neighbour classifier.
  3. Bag of SIFT representation and linear SVM classifier.

Part I: Tiny images representation and nearest neighbour classifier.

The implementation of Tiny images representation is first resizing each image to a fixed resolution, and normalize the image via zero mean and unit variance. On the other hand, the implementation of nearest neighbour classifier is also straightforward. We calculate the pairwise Euclidean distance between features and chooses the closest one as our predicted label.

code for Tiny images representation


function image_feats = get_tiny_images(image_paths)

dimension = [16 16];
image_feats = [];
num_images = size(image_paths, 1);
for i = 1:num_images
    img = single(imresize(imread(image_paths{i}),dimension));
    % normalization
    img = bsxfun(@minus,img,mean(img));
    img = img ./ std2(img);
    image_feats = [image_feats;reshape(img,[1,dimension(1)*dimension(2)])];
end

end

code for nearest neighbour classifier


function predicted_categories = nearest_neighbor_classify(train_image_feats, train_labels, test_image_feats)

dist_matrix = vl_alldist2(train_image_feats',test_image_feats');
[Y,I] = min(dist_matrix);
num_test_imgs = size(test_image_feats,1);
predicted_categories = cell(num_test_imgs,1);
for i = 1:num_test_imgs
    predicted_categories{i} = train_labels{I(i)};
end

end

Result

Below is the confusion matrix for tiny images representation + nearest neighbour classifier. The resolution of image is fixed to 16*16. The accuracy is 20.3%.

Part II: Bag of SIFT representation and nearest neighbour classifier.

Before we can represent our training and testing images as bag of feature histograms, we first need to establish a vocabulary of visual words. We will form this vocabulary by sampling many local features from our training set and then clustering them with kmeans. The number of kmeans clusters is the size of our vocabulary and the size of our features. After that we are capable of representing our training and testing images as histograms of visual words. For each image we will densely sample many SIFT descriptors. Instead of storing hundreds of SIFT descriptors, we simply count how many SIFT descriptors fall into each cluster in our visual word vocabulary. The final step of Bag of SIFT representation is normalizing the histgrams.

code for Bag of SIFT representation


function image_feats = get_bags_of_sifts(image_paths)
load('vocab_400.mat')
vocab_size = size(vocab, 2);

step = 5;
num_imgs = size(image_paths,1);
image_feats_base = zeros(num_imgs,vocab_size);

Gaussian=fspecial('gaussian');

for i=1:num_imgs
    img = single(imread(image_paths{i}));
    [m,n] = size(img);
    img = imfilter(img, Gaussian);
    [locations, features] = vl_dsift(img, 'step', step); 
    dist_matrix = vl_alldist2(vocab, single(features));    
    [Y,I] = min(dist_matrix);
    num_features = size(I,2);
    for feature_id=1:num_features
        image_feats_base(i, I(feature_id)) = image_feats_base(i, I(feature_id)) + 1;
    end
end
% normalization
image_feats = image_feats ./ max(image_feats(:));
end

Result

Below is the confusion matrix for Bag of SIFT representation + nearest neighbour classifier. The step used in building the vocabulary and extracting SIFT features is both 5, and I set the vocabulary size to be 400. The accuracy is 50.4%.

Part III: Bag of SIFT representation and linear SVM classifier.

Under SVM, the feature space is partitioned by a learned hyperplane and test cases are categorized based on which side of that hyperplane they fall on. Despite this model being far less expressive than the nearest neighbor classifier, it will often perform better. The predicted label is generated from the category with the highest confidence

code for linear SVM classifier


function predicted_categories = svm_classify(train_image_feats, train_labels, test_image_feats)
categories = unique(train_labels); 
num_categories = length(categories);
num_imgs = size(test_image_feats, 1);
% regularization factor
lambda = 0.00001;
predicted_categories = cell(num_imgs, 1);
vocab_size = size(train_image_feats, 2);
% initialize W and B matrix
W = zeros(vocab_size, num_categories);
B = zeros(1, num_categories);

for i = 1:num_categories
    matching_indices = strcmp(categories{i},train_labels);
    matching_indices = matching_indices * 2-1;
    [W(:,i), B(1,i)] = vl_svmtrain(train_image_feats', double(matching_indices), lambda);
end

for i = 1:num_imgs
    score = test_image_feats(i,:) * W + B;
    [Y, I] = max(score);
    predicted_categories{i}=categories{I};
end

end

Result

Below is the confusion matrix for Bag of SIFT representation + linear SVM classifier. The hyper parameter lambda influences the performance dramatically, and I obtained the best result when set lambda to be 0.00001. The accuracy is 68.5%.

Extra Credit

1. Experiment with many different vocabulary sizes and report performance : 10, 20, 50, 100, 200, 400
2. Add spatial information to your features by implementing the spatial pyramid feature representation.
3. Additionally implement the pyramid match kernel.
4. Use Fisher, one of the more sophisticated feature encoding schemes analyzed in the comparative study of Chatfield et al.
5. Experiment with features at multiple scales.

1. Experiment with many different vocabulary sizes and report performance : 10, 20, 50, 100, 200, 400

I experiment with vocabulary size 10, 20, 50, 100, 200, 400 + nearest neighbour classifier and the accuracy is 41.7%, 44.9%, 32.6%, 51.5%, 50.4%, 50.4%, respectively. From the chart generated below, we can see there's a increase from size 10 to 50, yet the growth is negligible after size exceeds 100. In addition, the running time increases drastically when the size of vocabulary grows.

2&3. Add spatial information to your features by implementing the spatial pyramid feature representation. Additionally implement the pyramid match kernel.

Instead of single level, we now add level 1 and level 2. The three levels of pyramid have size 1*1, 2*2, and 4*4 respectively. Pyramid match kernel then concatenate the individual histgrams using a predefined weight. This significantly improves our performance. Recall that Bag of SIFT representation + linear SVM classifier gives us an accuracy of 68.5%, and adding spatial information yields an accuracy of 75.4%

code for pyramid match kernel


function image_feature_with_level = get_feature_with_level(image, vocab, level, step)
if level == 1
    factor = 4;
else
    factor = 16;
end
max_rows = 2 ^ level;
max_cols = max_rows;
[m, n] = size(image);
vocab_size = size(vocab, 2);
image_feature_with_level = zeros(1, factor*vocab_size);

segment_id = 0;
for row = 1:max_rows
    for col = 1:max_cols
        tmp = zeros(1, vocab_size);
        [locations, features] = vl_dsift(image((row-1)*floor(m/max_rows)+1:row*floor(m/max_rows), (col-1)*floor(n/max_cols)+1:col*floor(n/max_cols)), 'step', step);
        dist_matrix = vl_alldist2(vocab,single(features)); 
        [Y,I] = min(dist_matrix);
        num_features = size(I,2);
        for feature_id = 1:num_features
            tmp(I(feature_id)) = tmp(I(feature_id))+1;
        end
        image_feature_with_level(segment_id*vocab_size+1:(segment_id+1)*vocab_size) = tmp;
        segment_id = segment_id + 1;
    end
end
end

function image_feats = get_bags_of_sifts(image_paths)
load('vocab_200.mat')
vocab_size = size(vocab, 2);

step = 5;
num_imgs = size(image_paths,1);
% level 0
image_feats_base = zeros(num_imgs,vocab_size);

level_1 = 4;
level_2 = 16;
image_feats_level_1 = zeros(num_imgs,level_1 * vocab_size);
image_feats_level_2 = zeros(num_imgs,level_2 * vocab_size);
Gaussian=fspecial('gaussian');

for i=1:num_imgs
    img = single(imread(image_paths{i}));
    [m,n] = size(img);
    img = imfilter(img, Gaussian);
    [locations, features] = vl_dsift(img, 'step', step); 
    dist_matrix = vl_alldist2(vocab, single(features));    
    [Y,I] = min(dist_matrix);
    num_features = size(I,2);
    for feature_id=1:num_features
        image_feats_base(i, I(feature_id)) = image_feats_base(i, I(feature_id)) + 1;
    end
    image_feats_level_1(i,:) = get_feature_with_level(img, vocab, 1, step);
    image_feats_level_2(i,:) = get_feature_with_level(img, vocab, 2, step);
end
% concatenate
weights = [1/4, 1/4, 1/2];
image_feats = [image_feats_base*weights(1), image_feats_level_1*weights(2), image_feats_level_2*weights(3)];
% normalization
image_feats = image_feats ./ max(image_feats(:));
end

Before and after adding spatial information.

4. Use Fisher, one of the more sophisticated feature encoding schemes analyzed in the comparative study of Chatfield et al.

Fisher encoding works similarly to SIFT. However, rather than storing visual word occurrences only, fisher encoding store a statistics of the difference between dictionary elements and pooled local features.

code for Fisher encoding


function [train_image_feats, test_image_feats] = get_fisher_encoding(train_image_paths, test_image_paths)

num_imgs = size(train_image_paths, 1);
step = 5;
numClusters = 50;
data = [];

for i = 1:num_imgs
    img = single(imread(train_image_paths{i}));
    [locations, features] = vl_dsift(img, 'norm','step',step);
    data = [data, single(features)];
end
[means, covariances, priors] = vl_gmm(data, numClusters);

train_image_feats = [];
test_image_feats = [];
for j = 1:num_imgs
    img_train = single(imread(train_image_paths{j}));
    img_test = single(imread(test_image_paths{j}));
    [locations, train_sift] = vl_dsift(img_train, 'norm','step',step);
    [locations, test_sift] = vl_dsift(img_test, 'norm','step',step);
    
    train_image_feat = vl_fisher(single(train_sift), means, covariances, priors, 'normalized');
    test_image_feat = vl_fisher(single(test_sift), means, covariances, priors, 'normalized');
    
    train_image_feats = [train_image_feats; train_image_feat'];
    test_image_feats = [test_image_feats; test_image_feat'];
end
end

Result

Below is the confusion matrix for Fisher encoding + linear SVM classifier. I set the step to be 5, and the number of clusters to be 50. The accuracy is 73.2%, which is pretty good.

The best result I obtained is 75.4%, from Bag of SIFT representation(with spatial pyramid feature representation) + linear SVM classifier.

Project 4 Results visualization.


Accuracy (mean of diagonal of confusion matrix) is 75.4%

Category name Accuracy Sample training images Sample true positives False positives with true label False negatives with wrong predicted label
Kitchen 0.620
LivingRoom

Bedroom

Bedroom

Bedroom
Store 0.620
Suburb

LivingRoom

Highway

InsideCity
Bedroom 0.580
LivingRoom

Kitchen

Kitchen

LivingRoom
LivingRoom 0.610
Bedroom

Bedroom

Bedroom

Kitchen
Office 0.870
Kitchen

Industrial

Bedroom

Kitchen
Industrial 0.620
TallBuilding

LivingRoom

Office

Suburb
Suburb 0.980
OpenCountry

InsideCity

Store

LivingRoom
InsideCity 0.750
TallBuilding

Store

Suburb

Kitchen
TallBuilding 0.740
InsideCity

InsideCity

Kitchen

Industrial
Street 0.860
InsideCity

Bedroom

InsideCity

Industrial
Highway 0.890
Street

Street

InsideCity

Street
OpenCountry 0.660
Industrial

Coast

Mountain

Coast
Coast 0.710
OpenCountry

Highway

Highway

Highway
Mountain 0.850
Forest

Store

Highway

OpenCountry
Forest 0.950
OpenCountry

TallBuilding

Mountain

Mountain
Category name Accuracy Sample training images Sample true positives False positives with true label False negatives with wrong predicted label