Confusion matrix for vocab_size = 400, using SVM classifier.
This project is to implement image recognition with different methods. First, we will use a simple method: tiny image (16x16) with nearest neighbor. Then we replace the image represenntation with bag of features. Finally, we use SVM instead of nearest neighbor as the classifier. So there are basically three steps:
In order to use bag of features to represent images, we need to build a vocabulary using the training images. This default size is set to 200. In extra credit part, I tested the size of 10, 50, 100, 200 and 400. The result will be shown in later section.
% tiny image
for i = 1 : num
image_feat = single(imresize(imread(image_paths{i}), [dim, dim]));
image_feat = bsxfun(@minus, image_feat, mean(image_feat));
image_feat = image_feat ./ std2(image_feat);
image_feats(i, :) = reshape(image_feat, [1, dim * dim]);
end
% nearest neighbor
Dist = vl_alldist2(train_image_feats', test_image_feats');
[~, I] = min(Dist);
...
for i = 1 : num
predicted_categories{i} = train_labels{I(i)};
end
% vocabulary construction
for i = 1 : num
data = single(imread(image_paths{i}));
[~, features] = vl_dsift(data,'step', 5);
sample = horzcat(sample, features);
end
[vocab, ~] = vl_kmeans(single(sample), vocab_size);
% bag of SIFT
for i = 1 : num
[locations, features] = vl_dsift(A, 'step', 5);
D = vl_alldist2(vocab, single(features));
[~, I] = min(D);
num_features = size(I, 2);
for j = 1 : num_features
image_feats_0(i, I(j)) = image_feats_0(i, I(j)) + 1;
end
end
% linear svm
for i = 1 : num_categories
matching_indices = strcmp(categories{i}, train_labels);
matching_indices = matching_indices * 2 - 1;
[W(:, i), B(1, i)] = vl_svmtrain(train_image_feats, double(matching_indices), LAMBDA);
end
for i = 1 : num_images
for j = 1 : num_categories
score(1, j) = test_image_feats(i, :) * W(:, j) + B(1, j);
end
[~, index] = max(score);
predicted_categories{i} = categories{index};
end
% spatial_pyramid.m
sp_size = 2 ^ level;
for row_part = 1 : sp_size
for col_part = 1 : sp_size
[~, features] = vl_dsift(image((row_part - 1) * floor(length / sp_size) + 1 : row_part * floor(length / sp_size),...
(col_part - 1) * floor(width / sp_size) + 1 : col_part * floor(width / sp_size)),'step', 5);
D = vl_alldist2(vocab, single(features));
[~, I] = min(D);
num_features = size(I, 2);
for j = 1 : num_features
%center
temp_feats(I(j)) = temp_feats(I(j)) + 1;
end
image_feats(part * vocab_size + 1 : (part + 1) * vocab_size) = temp_feats(1 : vocab_size);
temp_feats = zeros(1, vocab_size);
part = part + 1;
end
end
Tiny image + Nearest neighbor. Images were resized to 16x16, the accuracy was 20.3%, which is really bad. This is reasonable, because we lost a lot of high frequency information when we resize the images to small patches.
Bag of SIFT + Nearest neighbor. To speed up the process, I set step size to 5 in vl_dsift and the vocabulary size was 200. The accuracy was 46.3%. Compare to the tiny image representation method, we kept more information. Apparently, we represented the images better. However, the classifier is still too simple to reach a satisfying accuracy.
Bag of SIFT + Linear SVM. Here I set the step size to 5, vocabulary size to 200, and I found the best lambda is 0.00001. With these parameteres, I got the best accuracy, 67.4%. Though it seems better than the two method above, it could still be improved by larger vocabulary size or use spatial pyramid. I'll show that in next section for extra credit part.
For different vocabluary size, I got following results. The accruacy were 23.6%, 50.7%, 64.1% 67.4%, 70.4% for size of 10, 50, 100, 200, 400, respectively. As you can see, the accuracy increased significantly from size of 10 to 50, and 50 to 100. Since the running time increases dramatically with the vocab_size, the largest size I tested was 400. And it gave acceptable accuracy as shown below.
After I implemented 3 level spatial pyramid (1x1, 2x2, 4x4) and pyramid match kernel. I also tested it with different vocabulary sizes. The accruacy were 56.6%, 66.9%, 74.3% 76.6%, 77.8% for size of 10, 50, 100, 200, 400, respectively. Compare to the data above, one interesting thing is there is a big improvement for size of 10. And the best accuracy I got (at size of 400) also reached 77.8%, which is good for a linear SVM classifier.
This is the best result I could get. I implemented it with 3 level spatial pyramid and vocabulary size 400.
Category name | Accuracy | Sample training images | Sample true positives | False positives with true label | False negatives with wrong predicted label | ||||
---|---|---|---|---|---|---|---|---|---|
Kitchen | 0.640 | OpenCountry |
InsideCity |
Bedroom |
LivingRoom |
||||
Store | 0.610 | TallBuilding |
Industrial |
InsideCity |
Forest |
||||
Bedroom | 0.660 | Store |
LivingRoom |
Kitchen |
Industrial |
||||
LivingRoom | 0.630 | Store |
Store |
Bedroom |
Bedroom |
||||
Office | 0.890 | LivingRoom |
InsideCity |
LivingRoom |
LivingRoom |
||||
Industrial | 0.620 | TallBuilding |
Store |
OpenCountry |
Highway |
||||
Suburb | 0.970 | Store |
Industrial |
Store |
Store |
||||
InsideCity | 0.780 | Highway |
Highway |
TallBuilding |
Kitchen |
||||
TallBuilding | 0.780 | Store |
Store |
Industrial |
Forest |
||||
Street | 0.880 | InsideCity |
LivingRoom |
Store |
Highway |
||||
Highway | 0.840 | Street |
Store |
Store |
InsideCity |
||||
OpenCountry | 0.660 | Highway |
Coast |
Forest |
Mountain |
||||
Coast | 0.750 | OpenCountry |
OpenCountry |
Highway |
OpenCountry |
||||
Mountain | 0.820 | Kitchen |
Store |
OpenCountry |
OpenCountry |
||||
Forest | 0.960 | OpenCountry |
Mountain |
Mountain |
Mountain |
||||
Category name | Accuracy | Sample training images | Sample true positives | False positives with true label | False negatives with wrong predicted label |