The aim of this project was to approach image recognition using different forms of image representation and different classifiers such as the nearest neighbour classifier and linear SVMs.
Tiny images representation - Each image was reshaped into a 16x16 image. Following this, the image was modified to have zero mean. This representation is used as a baseline.
Nearest Neighbor classifier - The classifier looks at the test image features, finds the distances of that particular feature to all train image features, finds the index of the closest one, and assigns that corresponding train label to the current image representation. This implementation requires no training and is quite trivial since it only uses 1 NN as well. This method yielded about 20% accuracy in scene recognition.
Category name | Accuracy | Sample training images | Sample true positives | False positives with true label | False negatives with wrong predicted label | ||||
---|---|---|---|---|---|---|---|---|---|
Kitchen | 0.110 | Bedroom |
Mountain |
InsideCity |
TallBuilding |
||||
Store | 0.010 | Office |
Kitchen |
Coast |
Bedroom |
||||
Bedroom | 0.090 | Mountain |
Kitchen |
Coast |
Coast |
||||
LivingRoom | 0.080 | Kitchen |
Bedroom |
TallBuilding |
InsideCity |
||||
Office | 0.200 | Kitchen |
Bedroom |
Coast |
Forest |
||||
Industrial | 0.090 | Bedroom |
TallBuilding |
OpenCountry |
Mountain |
||||
Suburb | 0.210 | Store |
Mountain |
Highway |
Coast |
||||
InsideCity | 0.090 | OpenCountry |
Kitchen |
Coast |
Coast |
||||
TallBuilding | 0.130 | Bedroom |
Store |
Coast |
LivingRoom |
||||
Street | 0.380 | Suburb |
Kitchen |
Coast |
TallBuilding |
||||
Highway | 0.660 | Mountain |
Store |
OpenCountry |
Coast |
||||
OpenCountry | 0.330 | Coast |
Forest |
Highway |
Coast |
||||
Coast | 0.310 | Forest |
LivingRoom |
Forest |
Forest |
||||
Mountain | 0.120 | Store |
Office |
Coast |
Coast |
||||
Forest | 0.210 | Kitchen |
Store |
Coast |
TallBuilding |
||||
Category name | Accuracy | Sample training images | Sample true positives | False positives with true label | False negatives with wrong predicted label |
This implementation improved upon our baseline by using a Bag of SIFT representation of images. The first step here is to build a vocabulary of visual words. This was done by sampling local features from the training set and using k-means clustering algorithm on these features to form clusters that make up the vocabulary with their centroids eventually. The number of clusters used in this implementation was the specified size of the vocabulary.
We then represent all our images as 'Bags of SIFTS' by first densely sampling the SIFT descriptors (using a small step size while sampling accounts for the density). We then find which cluster each feature most closely corresponds to and build a histogram of the vocab matches for each of the images. The histogram is normalized and this ultimately gives us our feature representation.
To speed up computation in this section, KNNSearch was used which internally computes nearest neighbour based on KD-tree querying.
This representation was a vast improvement on our baseline, with accuracy increased up to 53.5%.
for i=1:numImages
img = imread(image_paths{i});
hist = zeros(1, vocab_size);
[locations, SIFT_features] = vl_dsift(single(img), 'step', 6);
nearest = knnsearch(vocab', SIFT_features');
rows = size(nearest, 1);
for j=1:rows
vocabMatch = nearest(j); %Nearest has the nearest vocab match index for each SIFT feature
hist(vocabMatch) = hist(vocabMatch) + 1;
end
image_feats(i,:) = hist/norm(hist);
end
Category name | Accuracy | Sample training images | Sample true positives | False positives with true label | False negatives with wrong predicted label | ||||
---|---|---|---|---|---|---|---|---|---|
Kitchen | 0.350 | Industrial |
Bedroom |
InsideCity |
Office |
||||
Store | 0.410 | LivingRoom |
InsideCity |
Mountain |
Forest |
||||
Bedroom | 0.320 | Office |
TallBuilding |
Street |
Suburb |
||||
LivingRoom | 0.380 | Bedroom |
Bedroom |
Bedroom |
Kitchen |
||||
Office | 0.750 | LivingRoom |
LivingRoom |
Bedroom |
Bedroom |
||||
Industrial | 0.360 | TallBuilding |
TallBuilding |
Office |
InsideCity |
||||
Suburb | 0.850 | TallBuilding |
Industrial |
Coast |
Highway |
||||
InsideCity | 0.380 | TallBuilding |
Suburb |
TallBuilding |
Store |
||||
TallBuilding | 0.400 | Mountain |
Industrial |
InsideCity |
Industrial |
||||
Street | 0.570 | TallBuilding |
Store |
Bedroom |
TallBuilding |
||||
Highway | 0.750 | Coast |
Street |
Coast |
OpenCountry |
||||
OpenCountry | 0.420 | Coast |
Mountain |
TallBuilding |
Store |
||||
Coast | 0.560 | OpenCountry |
TallBuilding |
Highway |
OpenCountry |
||||
Mountain | 0.600 | OpenCountry |
Industrial |
Bedroom |
OpenCountry |
||||
Forest | 0.920 | OpenCountry |
Mountain |
Suburb |
Suburb |
||||
Category name | Accuracy | Sample training images | Sample true positives | False positives with true label | False negatives with wrong predicted label |
This part included training a 1-vs-all linear SVM for each feature in the SIFT feature space. This classifier provides an edge over the nearest neighbor in that it falls less prey to being influenced by frequent occurrences of some words, and can better leverage the information it learns and correctly downweight 'irrelevant' dimensions. 15 Binary 1-vs-all SVM's were trained, and then each test image was run through all of these SVM's. The one with the maximum confidence provided the label that would be assigned to this test image. Accuracy improved once more, reaching 66.5%. The entire pipeline took about 9 minutes to completely run, using the feature approximation version of vl_dsift() i.e usage of its 'fast' parameter while building the vocabulary. Different values of the lambda parameter were tried while training the SVM and one that provided good results was 0.0001.
categories = unique(train_labels);
num_categories = length(categories);
confidences = zeros(num_categories,size(test_image_feats,1));
for i=1:num_categories
matching_indices = strcmp(categories(i), train_labels);
matching_indices = double(matching_indices);
matching_indices(matching_indices == 0) = -1;
[W,B] = vl_svmtrain(train_image_feats', matching_indices, 0.0001);
confidences(i,:) = W' * test_image_feats' + B;
end
[~, indices] = max(confidences);
predicted_categories = categories(indices);
Category name | Accuracy | Sample training images | Sample true positives | False positives with true label | False negatives with wrong predicted label | ||||
---|---|---|---|---|---|---|---|---|---|
Kitchen | 0.660 | Bedroom |
Office |
Mountain |
TallBuilding |
||||
Store | 0.440 | Kitchen |
Industrial |
InsideCity |
Mountain |
||||
Bedroom | 0.480 | OpenCountry |
LivingRoom |
LivingRoom |
Suburb |
||||
LivingRoom | 0.210 | Street |
Store |
Office |
Bedroom |
||||
Office | 0.870 | LivingRoom |
LivingRoom |
Store |
LivingRoom |
||||
Industrial | 0.540 | Bedroom |
TallBuilding |
Office |
Store |
||||
Suburb | 0.950 | Mountain |
Bedroom |
Street |
Street |
||||
InsideCity | 0.520 | Street |
Store |
TallBuilding |
TallBuilding |
||||
TallBuilding | 0.760 | Industrial |
InsideCity |
Street |
Store |
||||
Street | 0.720 | InsideCity |
LivingRoom |
InsideCity |
LivingRoom |
||||
Highway | 0.800 | Coast |
Store |
Coast |
Coast |
||||
OpenCountry | 0.460 | Coast |
Highway |
Coast |
Forest |
||||
Coast | 0.790 | Highway |
OpenCountry |
OpenCountry |
Industrial |
||||
Mountain | 0.820 | Store |
Store |
Coast |
Office |
||||
Forest | 0.950 | OpenCountry |
OpenCountry |
Mountain |
Mountain |
||||
Category name | Accuracy | Sample training images | Sample true positives | False positives with true label | False negatives with wrong predicted label |
Through the three parts of this project, we see more sophisticated versions of the pipeline being implemented, and see how better image representations and different kinds of classifiers affect the final accuracies of the result. Parameters like deciding the step size while sampling for SIFT features greatly affected the runtime of the entire pipeline. The step size chosen above was done so because it allowed for a reasonably good accuracy while not compromising much on speed.
Extra credit experimentation: Assessing the impact of vocabulary size on accuracy. The following graph shows the plot of scene recognition accuracy against size of the vocabulary used.
The trend is obvious in this graph, and shows that greater accuracy is achieved while using a larger vocabulary size. The reason for this is that we have more visual words to compare to and receive better/closer matches for our nearest matching vocab word when we compute features. There is a tradeoff with time here, though, as larger vocabulary sizes took longer to compute (size 200: 176 seconds while size 25: 34 seconds). This is attributed towards k-means running with a larger number of clusters to create and assign words correctly to.