Confusion Matrix of SIFT+SVM
The goal of this project is to introduce us to image recognition. Specifically, we will examine the task of scene recognition starting with very simple methods -- tiny images and nearest neighbor classification -- and then move on to more advanced methods -- bags of quantized local features and linear classifiers learned by support vector machines. My work in this project consists of the following three major parts.
The implementation of Tiny images representation is first resizing each image to a fixed resolution, and normalize the image via zero mean and unit variance. On the other hand, the implementation of nearest neighbour classifier is also straightforward. We calculate the pairwise Euclidean distance between features and chooses the closest one as our predicted label.
function image_feats = get_tiny_images(image_paths)
dimension = [16 16];
image_feats = [];
num_images = size(image_paths, 1);
for i = 1:num_images
img = single(imresize(imread(image_paths{i}),dimension));
% normalization
img = bsxfun(@minus,img,mean(img));
img = img ./ std2(img);
image_feats = [image_feats;reshape(img,[1,dimension(1)*dimension(2)])];
end
end
function predicted_categories = nearest_neighbor_classify(train_image_feats, train_labels, test_image_feats)
dist_matrix = vl_alldist2(train_image_feats',test_image_feats');
[Y,I] = min(dist_matrix);
num_test_imgs = size(test_image_feats,1);
predicted_categories = cell(num_test_imgs,1);
for i = 1:num_test_imgs
predicted_categories{i} = train_labels{I(i)};
end
end
Below is the confusion matrix for tiny images representation + nearest neighbour classifier. The resolution of image is fixed to 16*16. The accuracy is 20.3%.
Before we can represent our training and testing images as bag of feature histograms, we first need to establish a vocabulary of visual words. We will form this vocabulary by sampling many local features from our training set and then clustering them with kmeans. The number of kmeans clusters is the size of our vocabulary and the size of our features. After that we are capable of representing our training and testing images as histograms of visual words. For each image we will densely sample many SIFT descriptors. Instead of storing hundreds of SIFT descriptors, we simply count how many SIFT descriptors fall into each cluster in our visual word vocabulary. The final step of Bag of SIFT representation is normalizing the histgrams.
function image_feats = get_bags_of_sifts(image_paths)
load('vocab_400.mat')
vocab_size = size(vocab, 2);
step = 5;
num_imgs = size(image_paths,1);
image_feats_base = zeros(num_imgs,vocab_size);
Gaussian=fspecial('gaussian');
for i=1:num_imgs
img = single(imread(image_paths{i}));
[m,n] = size(img);
img = imfilter(img, Gaussian);
[locations, features] = vl_dsift(img, 'step', step);
dist_matrix = vl_alldist2(vocab, single(features));
[Y,I] = min(dist_matrix);
num_features = size(I,2);
for feature_id=1:num_features
image_feats_base(i, I(feature_id)) = image_feats_base(i, I(feature_id)) + 1;
end
end
% normalization
image_feats = image_feats ./ max(image_feats(:));
end
Below is the confusion matrix for Bag of SIFT representation + nearest neighbour classifier. The step used in building the vocabulary and extracting SIFT features is both 5, and I set the vocabulary size to be 400. The accuracy is 50.4%.
Under SVM, the feature space is partitioned by a learned hyperplane and test cases are categorized based on which side of that hyperplane they fall on. Despite this model being far less expressive than the nearest neighbor classifier, it will often perform better. The predicted label is generated from the category with the highest confidence
function predicted_categories = svm_classify(train_image_feats, train_labels, test_image_feats)
categories = unique(train_labels);
num_categories = length(categories);
num_imgs = size(test_image_feats, 1);
% regularization factor
lambda = 0.00001;
predicted_categories = cell(num_imgs, 1);
vocab_size = size(train_image_feats, 2);
% initialize W and B matrix
W = zeros(vocab_size, num_categories);
B = zeros(1, num_categories);
for i = 1:num_categories
matching_indices = strcmp(categories{i},train_labels);
matching_indices = matching_indices * 2-1;
[W(:,i), B(1,i)] = vl_svmtrain(train_image_feats', double(matching_indices), lambda);
end
for i = 1:num_imgs
score = test_image_feats(i,:) * W + B;
[Y, I] = max(score);
predicted_categories{i}=categories{I};
end
end
Below is the confusion matrix for Bag of SIFT representation + linear SVM classifier. The hyper parameter lambda influences the performance dramatically, and I obtained the best result when set lambda to be 0.00001. The accuracy is 68.5%.
1. Experiment with many different vocabulary sizes and report performance : 10, 20, 50, 100, 200, 400
2. Add spatial information to your features by implementing the spatial pyramid feature representation.
3. Additionally implement the pyramid match kernel.
4. Use Fisher, one of the more sophisticated feature encoding schemes analyzed in the comparative study of Chatfield et al.
5. Experiment with features at multiple scales.
I experiment with vocabulary size 10, 20, 50, 100, 200, 400 + nearest neighbour classifier and the accuracy is 41.7%, 44.9%, 32.6%, 51.5%, 50.4%, 50.4%, respectively. From the chart generated below, we can see there's a increase from size 10 to 50, yet the growth is negligible after size exceeds 100. In addition, the running time increases drastically when the size of vocabulary grows.
Instead of single level, we now add level 1 and level 2. The three levels of pyramid have size 1*1, 2*2, and 4*4 respectively. Pyramid match kernel then concatenate the individual histgrams using a predefined weight. This significantly improves our performance. Recall that Bag of SIFT representation + linear SVM classifier gives us an accuracy of 68.5%, and adding spatial information yields an accuracy of 75.4%
function image_feature_with_level = get_feature_with_level(image, vocab, level, step)
if level == 1
factor = 4;
else
factor = 16;
end
max_rows = 2 ^ level;
max_cols = max_rows;
[m, n] = size(image);
vocab_size = size(vocab, 2);
image_feature_with_level = zeros(1, factor*vocab_size);
segment_id = 0;
for row = 1:max_rows
for col = 1:max_cols
tmp = zeros(1, vocab_size);
[locations, features] = vl_dsift(image((row-1)*floor(m/max_rows)+1:row*floor(m/max_rows), (col-1)*floor(n/max_cols)+1:col*floor(n/max_cols)), 'step', step);
dist_matrix = vl_alldist2(vocab,single(features));
[Y,I] = min(dist_matrix);
num_features = size(I,2);
for feature_id = 1:num_features
tmp(I(feature_id)) = tmp(I(feature_id))+1;
end
image_feature_with_level(segment_id*vocab_size+1:(segment_id+1)*vocab_size) = tmp;
segment_id = segment_id + 1;
end
end
end
function image_feats = get_bags_of_sifts(image_paths)
load('vocab_200.mat')
vocab_size = size(vocab, 2);
step = 5;
num_imgs = size(image_paths,1);
% level 0
image_feats_base = zeros(num_imgs,vocab_size);
level_1 = 4;
level_2 = 16;
image_feats_level_1 = zeros(num_imgs,level_1 * vocab_size);
image_feats_level_2 = zeros(num_imgs,level_2 * vocab_size);
Gaussian=fspecial('gaussian');
for i=1:num_imgs
img = single(imread(image_paths{i}));
[m,n] = size(img);
img = imfilter(img, Gaussian);
[locations, features] = vl_dsift(img, 'step', step);
dist_matrix = vl_alldist2(vocab, single(features));
[Y,I] = min(dist_matrix);
num_features = size(I,2);
for feature_id=1:num_features
image_feats_base(i, I(feature_id)) = image_feats_base(i, I(feature_id)) + 1;
end
image_feats_level_1(i,:) = get_feature_with_level(img, vocab, 1, step);
image_feats_level_2(i,:) = get_feature_with_level(img, vocab, 2, step);
end
% concatenate
weights = [1/4, 1/4, 1/2];
image_feats = [image_feats_base*weights(1), image_feats_level_1*weights(2), image_feats_level_2*weights(3)];
% normalization
image_feats = image_feats ./ max(image_feats(:));
end
Fisher encoding works similarly to SIFT. However, rather than storing visual word occurrences only, fisher encoding store a statistics of the difference between dictionary elements and pooled local features.
function [train_image_feats, test_image_feats] = get_fisher_encoding(train_image_paths, test_image_paths)
num_imgs = size(train_image_paths, 1);
step = 5;
numClusters = 50;
data = [];
for i = 1:num_imgs
img = single(imread(train_image_paths{i}));
[locations, features] = vl_dsift(img, 'norm','step',step);
data = [data, single(features)];
end
[means, covariances, priors] = vl_gmm(data, numClusters);
train_image_feats = [];
test_image_feats = [];
for j = 1:num_imgs
img_train = single(imread(train_image_paths{j}));
img_test = single(imread(test_image_paths{j}));
[locations, train_sift] = vl_dsift(img_train, 'norm','step',step);
[locations, test_sift] = vl_dsift(img_test, 'norm','step',step);
train_image_feat = vl_fisher(single(train_sift), means, covariances, priors, 'normalized');
test_image_feat = vl_fisher(single(test_sift), means, covariances, priors, 'normalized');
train_image_feats = [train_image_feats; train_image_feat'];
test_image_feats = [test_image_feats; test_image_feat'];
end
end
Below is the confusion matrix for Fisher encoding + linear SVM classifier. I set the step to be 5, and the number of clusters to be 50. The accuracy is 73.2%, which is pretty good.
The best result I obtained is 75.4%, from Bag of SIFT representation(with spatial pyramid feature representation) + linear SVM classifier.
Accuracy (mean of diagonal of confusion matrix) is 75.4%
Category name | Accuracy | Sample training images | Sample true positives | False positives with true label | False negatives with wrong predicted label | ||||
---|---|---|---|---|---|---|---|---|---|
Kitchen | 0.620 | LivingRoom |
Bedroom |
Bedroom |
Bedroom |
||||
Store | 0.620 | Suburb |
LivingRoom |
Highway |
InsideCity |
||||
Bedroom | 0.580 | LivingRoom |
Kitchen |
Kitchen |
LivingRoom |
||||
LivingRoom | 0.610 | Bedroom |
Bedroom |
Bedroom |
Kitchen |
||||
Office | 0.870 | Kitchen |
Industrial |
Bedroom |
Kitchen |
||||
Industrial | 0.620 | TallBuilding |
LivingRoom |
Office |
Suburb |
||||
Suburb | 0.980 | OpenCountry |
InsideCity |
Store |
LivingRoom |
||||
InsideCity | 0.750 | TallBuilding |
Store |
Suburb |
Kitchen |
||||
TallBuilding | 0.740 | InsideCity |
InsideCity |
Kitchen |
Industrial |
||||
Street | 0.860 | InsideCity |
Bedroom |
InsideCity |
Industrial |
||||
Highway | 0.890 | Street |
Street |
InsideCity |
Street |
||||
OpenCountry | 0.660 | Industrial |
Coast |
Mountain |
Coast |
||||
Coast | 0.710 | OpenCountry |
Highway |
Highway |
Highway |
||||
Mountain | 0.850 | Forest |
Store |
Highway |
OpenCountry |
||||
Forest | 0.950 | OpenCountry |
TallBuilding |
Mountain |
Mountain |
||||
Category name | Accuracy | Sample training images | Sample true positives | False positives with true label | False negatives with wrong predicted label |