The goal of this project is to recognize 15 categories of scene using 2 different image representations (Tiny images & Bag of SIFT representation) and classification methods (nearest neighbor & linear SVM classifier), respectively. The items below are the required combinations for this project in the order:
In addition, the following items have been implemented for extra credits:
For the tiny images representaion, I resized image to 16x16, and then converted the image into zero mean and uniform standard deviation image as a feature. Nearest neighbor classifier compare the value of each feature of test image with every image in training data set and choose the label of the nearest one as a prediction.
Using this approach, 21.3% of accuracy was obtained.
For the implementation of bag of SIFT, I got SIFT features by using function "vl_dsift" from vl feat to quantize sift descriptor from image with a certain "STEP SIZE". Then, I clustered the features into a certain number (i.e. "VOCAB SIZE") of clusters. Using the cluster centers, a histogram based representation of each training image and each testing image was created.
Note that the following results were obtained after parameter tuning. The parameters I used were as follows: Size of Vocabulary: 400 and Step Size for Histogram Creation: 5
NOTE: The above parameters were chosen using cross validation and using multiple values of vocabulary. This has been described in the extra credit section.
Using this approach, 51.1% of accuracy was obtained.
To verify the effectiveness of classifier, the nearest neighbor classifier was replaced to 1-vs-all linear Support Vector Machines (SVMs). At evaluation phase, a test image is assigned the label of the classifier with the highest response. The parameters I used were as follows: Size of Vocabulary: 400, Step Size for Histogram Creation: 5, and LAMBDA = 0.0000003.
Using SVMs instead of Nearest Neighbor classifier increased performance to 70.2%.
I varied the vocabulary size from 10 ~ 3000 in 12 steps (non-equidistant intervals) with keeping all other parameters the same. Here is the graph of the accuracy I got for the same. In this plot, one can clearly see that accuracy is the best with "VOCAB SIZE" = 400.
I implemented cross validation by making 15 sets out of the dataset (training and testing) as suggested in the project document. I selected 100 random images in each set, for both training and testing. Of course, since I was running a cross validation framework, I did not plot a confusion matrix for this step. After 15 iterations, with my best tuned parameters, I got the following results: "MEAN ACCURACY = 43.2%", "STANDARD DEVIATION OF ACCURACY = 3.2%".
Pseudo-code for cross-validation
fprintf('Getting paths and labels for all train and test data\n')
[train_image_paths_whole, test_image_paths_whole, train_labels_whole, test_labels_whole] = ...
get_image_paths(data_path, categories, num_train_per_cat);
n=15; %number of sets
order=1:1500; % for 100 samples
neworder=order(randperm(1500)); % a random permutation of your data
neworder=reshape(neworder,numel(order)/n,n); % reshape so that each col is a 'set'
% NOTE: in the reshape, it is assumed that your data divides evenly into the given
% number of sets.
for i=(1:1:n)
train_image_paths = train_image_paths_whole(neworder(:,i));
test_image_paths = test_image_paths_whole(neworder(:,i));
train_labels = train_labels_whole(neworder(:,i));
test_labels = test_labels_whole(neworder(:,i));
% [INSERT REGULAR SCENE RECOGNITION CODE HERE]
end
Based on the paper [1], the algorithm I followed is as follows:
The parameters I used were same as the case 2 ( Size of Vocabulary: 400 and Step Size for Histogram Creation: 5). Compared to normal nearest neighbors classifier, accuracy increased from 51.1% to 56.1%.
case 'naive bayes nearest neighbor'
% with the funtion code 'naive_bayes_nearest_neighbor.m'
predicted_categories = naive_bayes_nearest_neighbor(test_image_paths, train_labels, test_image_paths);
Intuitively, GIST summarizes the gradient information (scales and orientations) for different parts of an image, which provides a rough description (the gist) of the scene.
While The parameters I used were keeping same (Size of Vocabulary: 400, Step Size for Histogram Creation: 5, and LAMBDA = 0.0000003), I did two things: 1) replace SIFT into GIST (+SVMs) to compare SIFT with GIST directly and 2) add GIST to SIFT (+SVMs) to acheive higher accuracy.
%with the function code 'get_gist_features.m'
%SIFT + GIST
train_sift = get_bags_of_sifts(train_image_paths);
test_sift = get_bags_of_sifts(test_image_paths);
train_gist = get_gist_features(train_image_paths);
test_gist = get_gist_features(test_image_paths);
train_image_feats = horzcat(train_sift, train_gist);
test_image_feats = horzcat(test_sift, test_gist);
The combiniation of Bag of (SIFT and GIST) + Linear SVM yields the highest accuracy results as below.
Category name | Accuracy | Sample training images | Sample true positives | False positives with true label | False negatives with wrong predicted label | ||||
---|---|---|---|---|---|---|---|---|---|
Bedroom | 0.520 | LivingRoom |
LivingRoom |
Office |
LivingRoom |
||||
Coast | 0.770 | OpenCountry |
Mountain |
OpenCountry |
OpenCountry |
||||
Forest | 0.880 | OpenCountry |
Mountain |
Mountain |
Store |
||||
Highway | 0.870 | Street |
InsideCity |
Coast |
Bedroom |
||||
Industrial | 0.570 | Kitchen |
Store |
Suburb |
OpenCountry |
||||
InsideCity | 0.750 | Street |
Industrial |
Forest |
OpenCountry |
||||
Kitchen | 0.450 | Bedroom |
Bedroom |
Store |
Bedroom |
||||
LivingRoom | 0.500 | Bedroom |
Kitchen |
Bedroom |
Store |
||||
Mountain | 0.780 | Forest |
OpenCountry |
TallBuilding |
OpenCountry |
||||
Office | 0.810 | Bedroom |
Kitchen |
Kitchen |
Bedroom |
||||
OpenCountry | 0.730 | Coast |
Coast |
Coast |
Forest |
||||
Store | 0.540 | Industrial |
InsideCity |
Mountain |
TallBuilding |
||||
Street | 0.820 | Highway |
LivingRoom |
Bedroom |
InsideCity |
||||
Suburb | 0.900 | LivingRoom |
Store |
LivingRoom |
Industrial |
||||
TallBuilding | 0.760 | Mountain |
Industrial |
Kitchen |
Store |
||||
Category name | Accuracy | Sample training images | Sample true positives | False positives with true label | False negatives with wrong predicted label |
[1] Boiman, Oren, Eli Shechtman, and Michal Irani. "In defense of nearest-neighbor based image classification." Computer Vision and Pattern Recognition, 2008 (CVPR 2008), IEEE Conference on, 2008.
[2] Oliva, Aude, and Antonio Torralba. "Modeling the shape of the scene: A holistic representation of the spatial envelope." International Journal of Computer Vision 42(3), pp. 145-175, 2001.