For this project, the goal was to be able to compute image recognition. Specifically, I would be doing it on the 15 scene database mentioned in Lazebnik et al. 2006. There were two feature-generation methods (Tiny Images and Bags of Sift) and two classifier methods to implement (KNN and SVM). I then had to perform a series of tests on three combinations of these classifier + feature generation methods in addition to the base case of not utilizing any of these combinations. Overall, the worst performing combination to the best performing combinations are shown below
Tiny Images is a very simple image representations and is done to be a baseline comparison for the bag of sift feature representation. It is simply done by resizing a given image to a smaller size. In this case, I am converting it to a 16 x 16 pixel representation, which is then flattened out into a 1 x 256 pixel representation, and then finally stored in a matrix of N x 256 where N is the number of images. One reason why this didn't do amazing is because it is not very good at recognizing spatial changes. In addition to that, it also gets rid of the high frequencies in the image.
KNN basically looks at the 'K' nearest neighbors and classifies the current feature based off of that. It is very fast since it requires no training, but it is very vulnerable to noise. But that can be partially remedied by looking at the larger K values. Because of that, I avoided the 1NN scenario in order to increase accuracy. As seen below, I tested for many different K values up to 10, and saw that 5 performed the best.
K Value | Accuracy (% Correct) |
---|---|
1 | 19.1 |
2 | 19.2 |
3 | 19.8 |
4 | 20.5 |
5 | 20.8 |
6 | 20.0 |
7 | 19.9 |
8 | 19.4 |
9 | 19.6 |
10 | 19.7 |
for index = 1:num_entries
img = imread(image_paths{index});
[~, SIFT_features] = vl_dsift(single(img),'step',7,'fast');
D = vl_alldist2(single(SIFT_features), vocab);
[~,I] = min(D,[],2);
% calculating the frequency of each index, and creating a histogram out
% of it
current_histogram = zeros(1,vocab_size);
for i = I'
current_histogram(1,i) = current_histogram(1,i) +1;
end
image_feats(index,:) = current_histogram./(sum(current_histogram)); %normalizing
end
One implementation choice that I made was the fact that I chose the 'step' of 7 parameters for when I called vl_dsift. This is because it would have taken too much space to go over every single pixel, and thus would cause constant failures since there wouldn't be enough memory to allocate to the function. In addition to that, it would make it run faster as well. The 'fast' parameter allows me to make it run faster, knowing that I am sacrificing some accuracy for drastically decreased performance time.
This combination of features and classification performed the best. It yielded an accuracy of 64.5%. For this part, the only thing that was different compared to the last combination was that I am using a linear svm classifier instead of a KNN classifier. The svm classifier specifically iterates across every unique classifier, generates binary labels by making all of the matched indices equal to 1 and everything else as a -1. I also finetuned the value of lambda such that it yielded the highest values. I then passed in train_image_feats, the binary labels, and lambda into vl_svmtrain and saved all of the weights and offsets. Finally, I calculated the confidence using the equation, W*X + B where '*' is the dot product and W and B are the learned hyperplane parameters, specifcally the weights and offsets calculated before. I then retrieve the index of the max confidence for that specific test feature, and return the corresponding label.
Lambda | Accuracy (% Correct) |
---|---|
0.0001 | 60.0 |
0.00001 | 63.3 |
0.000001 | 64.5 |
Category name | Accuracy | Sample training images | Sample true positives | False positives with true label | False negatives with wrong predicted label | ||||
---|---|---|---|---|---|---|---|---|---|
Kitchen | 0.540 | InsideCity |
InsideCity |
InsideCity |
TallBuilding |
||||
Store | 0.490 | InsideCity |
LivingRoom |
Kitchen |
Mountain |
||||
Bedroom | 0.410 | LivingRoom |
LivingRoom |
Industrial |
Office |
||||
LivingRoom | 0.190 | Bedroom |
Bedroom |
Kitchen |
Industrial |
||||
Office | 0.850 | Kitchen |
Bedroom |
Bedroom |
LivingRoom |
||||
Industrial | 0.470 | Kitchen |
OpenCountry |
OpenCountry |
Street |
||||
Suburb | 0.920 | Street |
Industrial |
Highway |
Industrial |
||||
InsideCity | 0.560 | Store |
Highway |
TallBuilding |
Kitchen |
||||
TallBuilding | 0.790 | LivingRoom |
Kitchen |
Forest |
Coast |
||||
Street | 0.620 | Forest |
InsideCity |
TallBuilding |
Bedroom |
||||
Highway | 0.810 | Street |
Street |
Coast |
Forest |
||||
OpenCountry | 0.500 | Industrial |
Highway |
Highway |
Suburb |
||||
Coast | 0.790 | OpenCountry |
TallBuilding |
OpenCountry |
Highway |
||||
Mountain | 0.830 | OpenCountry |
OpenCountry |
LivingRoom |
OpenCountry |
||||
Forest | 0.910 | OpenCountry |
OpenCountry |
Mountain |
Suburb |
||||
Category name | Accuracy | Sample training images | Sample true positives | False positives with true label | False negatives with wrong predicted label |