In this project, I implement scene recognition algorithms using different feature representation, feature quantization and classification methods and evaluate the performance and accuracy of different combinations of of methods. The basic combinations are listed below and the extra credit features are discussed later.
This is the simplest configuration of methods. I get a 20% accuracy for this setting.
The tiny images representation simply transforms an image to a 16-by-16 array and use it as a 16 * 16 = 256 dimension feature. It also normalize the 256 features for better performance.
The nearest neighbor classifier simply finds the training example with the smallest feature distance from the test image. In this project, I use the one nearest neighbor classifier and use the label of the nearest neighbor as the predicted class for the test image.
By using a better feature representation method, I can improve the accuracy to 50%.
Bag of SIFT representation requires building the visual word vocabularies and extract bag of sift features from training and test images. When building the vocabularies, I sample the sift feature discriptors with a step of 20. The step can be relatively large for building the vocabularies so as to improve the speed. If I use a smaller step such as 15, I can get around 53% accuracy but suffer a large increase in proccessing time. For extracting bag of sift features, I use a smaller step 10 to sample the sift features in an image so that I can have more sift features for building the visual words histogram. For each sift feature, I simply add 1 to the histogram bin representing the visual word cluster centroid closest to it. Finally, the histogram bins are normalized to represent bag of sift features.
By using SVM classifiers with bag of SIFT representation, I get around 62% accuracy.
I use N binary SVM classifiers where N is the number of categories. Each classifier is trained with the training image bag of sift features. It uses an important parameter LAMBDA. By experimenting I find that 0.0003 acheives the best accuracy. A smaller number can acheive an accuracy of 61% - 62% but a LAMBDA larger than 0.0005 drops the accuracy significantly. For each test image, its bag of sift features go through the N SVM classifies and the one with the largest output should be the predicted category of the test image.
Instead of making each sift feature vote for a single nearest sift feature cluster centroid, I implemented the soft assginment which adds a value to the histogram bins weighted by the distance between a sift feature to the cluster centroids. The weight is calculated by exp(-(d^2) / (2 * sigma^2)) where d is the distance between the feature and a centroid. The weight is essentially a gassian centered at 0 distance. The farther the distance is the smaller the weight is. Sigma is the most important paramter which controls the shape of the gaussian. Since the distances are of magnitude of 10^5, the sigma should be magnitude of 10^5 as well. By experiment, 200000 seems to be the best value here.
However, the accuracy actually drops to around 45% compared to 62% when using simple assignment. The reason may be that adding weight to all the centroids ntroduces too much noise. Therefore, I modified the method to adding weight only to the N most closest centroids where N is an adjustable parameter. By experiment, I find that 3 is the best number for N. With the improvement, I get an accuracy of 64%, which is slightly better than using simple assignment.
% For each image:
[locations, SIFT_features] = vl_dsift(single(img), 'fast', 'step', 10);
D = vl_alldist2(vocab', single(SIFT_features));
for j = 1 : size(D, 2)
[~, idx] = sort(D(:, j));
idx = idx(1 : 3);
e = exp(-((D(idx, j)' .^ 2) ./ (2 * (SIGMA ^ 2))));
image_feats(i, idx) = image_feats(i, idx) + e;
end
Since a linear classifier has limitation in high dimensional feature spaces, more complicated kernels such as gaussian-like kernels may do a better job in sperating points in high dimensional feature spaces. Instead of using a linear SVM classifier, I compute an rbf kernel for the SVM classifiers. I use Olivier Chapelle's MATLAB code: http://olivier.chapelle.cc/primal/. The kernel uses a parameter sigma to define its gaussian property. I set it to 50 according to experiments.
Unfortunately, my rbf kernel actually decreases the accuracy to 50%.
% calculate the rbf kernels
hp.type = 'rbf';
hp.sig = 50;
X = test_image_feats;
test_K = compute_kernel(1:N, 1:N, hp);
X = train_image_feats;
K = compute_kernel(1:N, 1:N, hp);
% for each category
opt.cg = 1;
[beta, b] = primal_svm(0, labels, LAMBDA, opt);
predicted_labels(i, :) = test_K * beta + b;
Category name | Accuracy | Sample training images | Sample true positives | False positives with true label | False negatives with wrong predicted label | ||||
---|---|---|---|---|---|---|---|---|---|
Kitchen | 0.680 | Store |
LivingRoom |
InsideCity |
LivingRoom |
||||
Store | 0.450 | Bedroom |
InsideCity |
Street |
Forest |
||||
Bedroom | 0.390 | Office |
Kitchen |
TallBuilding |
LivingRoom |
||||
LivingRoom | 0.280 | Kitchen |
Street |
Bedroom |
TallBuilding |
||||
Office | 0.890 | LivingRoom |
Kitchen |
Kitchen |
Suburb |
||||
Industrial | 0.410 | Highway |
Kitchen |
Highway |
Mountain |
||||
Suburb | 0.930 | Highway |
Forest |
Street |
Coast |
||||
InsideCity | 0.460 | Store |
Store |
Highway |
Industrial |
||||
TallBuilding | 0.810 | LivingRoom |
InsideCity |
Industrial |
Mountain |
||||
Street | 0.610 | OpenCountry |
TallBuilding |
InsideCity |
Store |
||||
Highway | 0.770 | Store |
OpenCountry |
Coast |
Coast |
||||
OpenCountry | 0.330 | TallBuilding |
Industrial |
Suburb |
Coast |
||||
Coast | 0.810 | Suburb |
Highway |
OpenCountry |
Highway |
||||
Mountain | 0.820 | Industrial |
Highway |
LivingRoom |
Suburb |
||||
Forest | 0.920 | Store |
OpenCountry |
Mountain |
Mountain |
||||
Category name | Accuracy | Sample training images | Sample true positives | False positives with true label | False negatives with wrong predicted label |