In this project, I examine the task of scene recognition starting with very simple methods- tiny images and nearest neighbor classification and then move on to more advanced methods like bags of quantized local features and linear classifiers learned by support vector machines.
This project can hence be divided into 2 parts for representation:
And 2 parts for classification:
The "tiny image" representation (inspired by the work of Torralba, Fergus, and Freeman) is the simplest possible image representation. We simply resizes each image to 16*16 and make sure the images have zero mean and unit length (normalize them).
The matlab code to do the same is provided below:
image_feats = zeros(size(image_paths,1),16*16);
for imageIndex = 1:size(image_paths,1)
current_image = imread(char(image_paths(imageIndex)));
new_image = imresize(current_image, [16 16]);
new_image_single_row = reshape(new_image, [1 16*16]);
new_image_single_row = double(new_image_single_row);
new_image_single_row = new_image_single_row - mean(new_image_single_row);
new_image_single_row = new_image_single_row./norm(new_image_single_row);
image_feats(imageIndex,:) = new_image_single_row;
end
we first need to establish a vocabulary of visual words. We form this vocabulary by sampling SIFT features from our training set and then clustering them with kmeans. The number of clusters would be equal to the vocubulary size that we chose. To classify new SIFT feature we observe, we can figure out which class it belongs to as long as we save the centroids of our original clusters. These centroids are our visual word vocabulary. The code illustrates how we can build the vocabular in matlab
num_images = size(image_paths,1);
sampled_sift = zeros(128,1);
counter = 1;
for image_index = 1:num_images
cur_image = imread(char(image_paths(image_index)));
cur_image_single = im2single(cur_image);
[locs, sift_feats] = vl_dsift(cur_image_single, 'Step', 15, 'Fast');
num_curr_feats = size(sift_feats,2);
sampled_sift(:,counter:counter+num_curr_feats) = sift_feats(:,:);
counter = counter + num_curr_feats;
end
size(samples_sift)
[centers, assignments] = vl_kmeans(sampled_sift, vocab_size);
vocab = single(centers');
For each image, we sample SIFT descriptors and count how many SIFT descriptors fall into each cluster in our visual vocabulary. We do this by finding the nearest neighbor kmeans centroid for every SIFT feature. The following matlab code illustrates this very easily:
vocab_size = size(vocab, 1);
num_images = size(image_paths,1);
image_feats = zeros(num_images, vocab_size);
num_dists_to_consider = 15;
for image_index = 1:num_images
histogram = zeros(vocab_size,1);
z = image_index;
cur_image = imread(char(image_paths(image_index)));
cur_image_single = im2single(cur_image);
[locs, sift_feats] = vl_dsift(cur_image_single, 'Step', 15, 'Fast');
all_dist = (vl_alldist2(uint8(vocab'),uint8(sift_feats)));
[Y, sortedIndices] = sort(all_dist,1);
for j = 1:num_dists_to_consider
for feature_i = 1:size(all_dist,2)
vocab_to_vote = sortedIndices(j,feature_i);
histogram(vocab_to_vote) = histogram(vocab_to_vote)+1;
end
end
%normalize and store
image_feats(image_index, :) = (histogram/norm(histogram))';
end
end
This is a very popular classification technique. When tasked with classifying a test sample into a particular category, we simply find the "nearest" K training examples and assign the test case the label of the most frequently occuring class. We can also implement more advanced strategies where we weigh each class depending on how "far" the training example was, from our test sample.
The next classifier is built by training multiple 1-vs-all class linear SVMS to operate in the bag of SIFT feature space. Linear classifiers are one of the simplest learning models. The feature space is partitioned by a learned hyperplane and test cases are categorized based on which side of that hyperplane they fall on. We simply query each model for each test example and check which one is the most confident about its prediction. That model will determine the class for that example. The matlab code illustrates the same:
categories = unique(train_labels);
num_categories = length(categories);
num_training = size(train_image_feats, 1);
num_test = size(test_image_feats,1);
num_each_test = size(test_image_feats,2);
train_weights = zeros(num_categories, num_each_test);
train_offsets = zeros(num_categories, 1);
for category_index = 1 : num_categories
curr_labels = zeros(num_training,1);
curr_labels = curr_labels -1;
curr_labels(strcmp(char(categories(category_index)), train_labels)) = 1;
[W, B] = vl_svmtrain(train_image_feats', curr_labels, 0.0001);
train_weights(category_index,:) = W';
train_offsets(category_index) = B;
end
predicted_categories = cell(num_test,1);
for test_image = 1:num_test
maxScore = intmin('int8') ;
bestCategory = '';
for category_index = 1 : num_categories
cur_score = train_weights(category_index,:)*(test_image_feats(test_image,:)') + train_offsets(category_index);
if cur_score > maxScore
maxScore = cur_score;
bestCategory = char(categories(category_index));
end
end
predicted_categories{test_image} = bestCategory;
end
end
I resized images to 16*16 and set the K to 10. I also normalized the images and made them zero mean. Normalization improved the accuracy by around 3%. Extra credit results are discussed later.
I used K = 10 here as well. I set the step to 15 and used the Fast parameter while determining the SIFT features. I also normalized the histograms before storing them as features. These results are without any of the extra credit modifications.
Everything said earlier above bag of SIFT features still holds here. For the Linear SVM, I set the lambda to 0.0001 and trained a classifier for each class. This particular pipeline's results also includes the extra credit that is described in more detail at the end of the document.
Category name | Accuracy | Sample training images | Sample true positives | False positives with true label | False negatives with wrong predicted label | ||||
---|---|---|---|---|---|---|---|---|---|
Kitchen | 0.310 | Bedroom |
Store |
Mountain |
Bedroom |
||||
Store | 0.260 | Industrial |
InsideCity |
TallBuilding |
Highway |
||||
Bedroom | 0.430 | Office |
InsideCity |
Office |
Kitchen |
||||
LivingRoom | 0.080 | Kitchen |
Bedroom |
Industrial |
Office |
||||
Office | 0.970 | Bedroom |
InsideCity |
Bedroom |
Street |
||||
Industrial | 0.240 | LivingRoom |
LivingRoom |
InsideCity |
Mountain |
||||
Suburb | 0.950 | Store |
Mountain |
Street |
InsideCity |
||||
InsideCity | 0.510 | LivingRoom |
TallBuilding |
Industrial |
Suburb |
||||
TallBuilding | 0.700 | Bedroom |
Industrial |
Bedroom |
InsideCity |
||||
Street | 0.600 | Suburb |
LivingRoom |
LivingRoom |
Suburb |
||||
Highway | 0.720 | OpenCountry |
OpenCountry |
Suburb |
Mountain |
||||
OpenCountry | 0.500 | Mountain |
Coast |
Forest |
Mountain |
||||
Coast | 0.810 | OpenCountry |
OpenCountry |
OpenCountry |
OpenCountry |
||||
Mountain | 0.770 | Store |
Highway |
Suburb |
Suburb |
||||
Forest | 0.910 | OpenCountry |
Mountain |
Mountain |
Mountain |
||||
Category name | Accuracy | Sample training images | Sample true positives | False positives with true label | False negatives with wrong predicted label |
I tried to improve upon my best results above by trying these following approaches. I ran each of these independent of each other as I wanted to compare the improvements.
I modified my code so that it considers the distance to the centroid while building the histogram. Hence, each feature adds some weight proportional to how far it is from each centroid.
distance_of_vote = all_dist(j,feature_i)+1;
actual_score = min(0,1/distance_of_vote);
histogram(vocab_to_vote) = histogram(vocab_to_vote)+actual_score;
This modestly increased the accuracy by 2% over the existing result, to 23%.
I added the code required for performing a 10-fold cross-validation on the whole pipeline. It prints the mean accuracy in the end. Cross validation gives better results at times because of the change in the train and test sets between folds. The results for the first two folds for KNN and Tiny features were: 24.7% and 26%.
The initially reported results were for a vocab size of 200. Changing the vocab size to 50 resulted in 2% increase in accracy, 60.6%. Using a vocab size of 100 results in 59.7% and a vocab size of 150 resulted in 60.8%. I also tried a vocab size of 250, which resulted in an accuracy of 61.2%. Overall, all of them seemed to achieved better results than the initial size.
For chosing a value of lambda in SVM, the code tried different values of lambda and picks the one with the highest training accuracy. It then uses this lambda to build the classifiers. It picked the value of 0.0001 in my tests.
I also tried a similar approach for chosing the number of neighbors in KNN. It tries various values for K and choses the one that it finds the highest training accuracy for. It always picked 10 in my tests.
I attempted to improve the KNN classifier by using an approach similar to soft assignment. Instead of just picking the most frequent class, I assigned scores to each class depending on the distance and picked the one with the highest score. This resulted in the tiny images + KNN pipeline performing slightly better, with an accuracy of 23.3%.