In this project we attempt to do scene recognition on various test images. We implement different feature extraction methods and different classifiying methods. For feature extraction tiny images and bag of SIFT features are implementd. For classification, we implement nearest neighbor and a 1 vs all linear SVM. The flow of this process is pretty simple: for all the images, training and test, we extract features, we then classify each test image based on some metric using the training images as reference.
Tiny images is a simple and poor way to extract features. It simply resizes the image in to a 16x16 image. It then takes this matrix and converts it to a vector 256 elements long. This is now the 'feature' that represents this image. This is a very simple and over simplified feature extraction method, and therefore does not produce very accurate results.
Nearest neighbor is the simplest type of classifier. In this classifer, we find the distance from each test image to every train image. For every test image we find the closest training image and that is the label for the training image.
To implement K-NN each of the closest neighbors vote towards what the label should. The parameter k determines how many closest neighbors get a vote. The label with the highest vote amongst the k closest neighbors is assigned to the training label.
In order to extract features using of bag of SIFT, we must first create a vocabulary of image "words" based on our training images and labels. To do this, we sample a certain number of SIFT features across all the training images. I decided to choose an equal number of features from every training image. For every image, I create a set of SIFT features, then randomly sample features from that set. Once I've gathered tens of thousands of SIFT features, I then create a vocabulary. I create a vocabulary through K-Means using VL's vl_kmeans function.
% Read each image, get sift features, sample random set
for i=1:size(image_paths,1)
image = im2single(imread(image_paths{i}));
[~, siftFeatures] = vl_dsift(image, 'step', siftStepSize);
sampledSiftFeatures = [sampledSiftFeatures datasample(siftFeatures, siftFeaturesPerImage, 2)];
end
% Generate vocabulary
[vocab, ~] = vl_kmeans(single(sampledSiftFeatures), vocab_size);
To extract features, we must describe each image with a bag of words using our pre-built vocabulary. For every image, we get it's SIFT features. We then find the distance of every extracted SIFT to every word in the vocabulary. The dimensions for a word and a SIFT feature are the same since we built the vocabulary using SIFT features. Then for each feature we select the closes word and that is the word for that SIFT feature. We create a histogram of all the words and this histogram becomes the feature descriptor for this image.
% Generate SIFT features
[~, siftFeatures] = vl_dsift(image, 'step', siftStepSize);
% siftDistances(i,j) = local feature i distance to vocab word j
siftDistances = vl_alldist2(single(siftFeatures), vocab);
% Get words for each sift feature
[~, indecies] = min(siftDistances, [], 2);
% Create histogram of words based on local features
siftHistogram = zeros(1, vocab_size);
for j=1:size(indecies, 1)
wordIndex = indecies(j, 1);
siftHistogram(1, wordIndex) = siftHistogram(1, wordIndex) + 1;
end
The final part was to create a linear Support Vector Machine (SVM) classifer. Since we have 15 different categories our method uses a 1-vs-all approach. For every category, we train an SVM and then evaluate each test image with it. At the end we have a distance matrix with a value for each test image evaluated by each SVM. For every image, the SVM that produced the most positive value is the category the test image corresponds to.
% Find binary results with train labels, change 0s to -1s
matchingIndices = double(strcmp(currentCategory, train_labels));
matchingIndices(matchingIndices==0)=-1;
% Get classifier for category
[W B] = vl_svmtrain(train_image_feats', matchingIndices, LAMBDA);
% Evaluate each test feature with this classifier
distances(:,i) = W'*test_image_feats' + B;
Feature | Classifier | Feature Params | Classify Params | Accuracy |
---|---|---|---|---|
Tiny Images | K-NN | 16x16 | k=10, L1 | 0.220 |
Bag of Sift | K-NN | step size=8, fast enabled, vocab size=400, vocab samples=50,000, vocabstep size=1 | k=10, L1 | 0.515 |
Bag of Sift | SVM | step size=8, fast enabled, vocab size=400, vocab samples=50,000, vocabstep size=1 | Lamda=0.001 | 0.630 |
Feature | Classifier | Feature Params | Classify Params | Accuracy |
---|---|---|---|---|
Bag of Sift | SVM | step size=5, vocab size=400, vocab samples=50,000, vocabstep size=1 | Lamda=0.00001 | 0.691 |
Bag of Sift | SVM | step size=5, vocab size=10, vocab samples=50,000, vocabstep size=2 | Lamda=0.001 | 0.38867 |
Bag of Sift | SVM | step size=5, vocab size=25, vocab samples=50,000, vocabstep size=2 | Lamda=0.001 | 0.52667 |
Bag of Sift | SVM | step size=5, vocab size=50, vocab samples=50,000, vocabstep size=2 | Lamda=0.001 | 0.57867 |
Bag of Sift | SVM | step size=5, vocab size=100, vocab samples=50,000, vocabstep size=2 | Lamda=0.001 | 0.58733 |
Bag of Sift | SVM | step size=5, vocab size=250, vocab samples=50,000, vocabstep size=2 | Lamda=0.001 | 0.62133 |
Bag of Sift | SVM | step size=5, vocab size=500, vocab samples=50,000, vocabstep size=2 | Lamda=0.001 | 0.65333 |
Bag of Sift | SVM | step size=5, vocab size=1000, vocab samples=50,000, vocabstep size=2 | Lamda=0.001 | 0.64867 |
Bag of Sift | SVM | step size=5, vocab size=2500, vocab samples=50,000, vocabstep size=2 | Lamda=0.001 | 0.65467 |
Bag of Sift | SVM | step size=5, vocab size=5000, vocab samples=50,000, vocabstep size=2 | Lamda=0.001 | 0.64333 |
I also implmented a file named "multi_vocab.m" that creates vocabularies of varying sizes, and then runs bag of sift and SVM with those vocabularies. Vocabulary size increases the accuracy of classification, but soon levels off around 1000. A graph of the response is shown below, as well as a gif of all the confusion matrices generated.
Accuracy (mean of diagonal of confusion matrix) is 0.691
Category name | Accuracy | Sample training images | Sample true positives | False positives with true label | False negatives with wrong predicted label | ||||
---|---|---|---|---|---|---|---|---|---|
Kitchen | 0.650 | Bedroom |
Office |
Store |
Suburb |
||||
Store | 0.530 | Kitchen |
Street |
InsideCity |
Industrial |
||||
Bedroom | 0.470 | LivingRoom |
Store |
LivingRoom |
LivingRoom |
||||
LivingRoom | 0.290 | Kitchen |
Street |
Bedroom |
Office |
||||
Office | 0.960 | Bedroom |
LivingRoom |
Bedroom |
Kitchen |
||||
Industrial | 0.690 | Bedroom |
LivingRoom |
TallBuilding |
Highway |
||||
Suburb | 1.000 | Mountain |
Bedroom |
||||||
InsideCity | 0.610 | Highway |
Street |
Store |
Industrial |
||||
TallBuilding | 0.720 | Industrial |
InsideCity |
InsideCity |
Industrial |
||||
Street | 0.540 | Highway |
TallBuilding |
InsideCity |
TallBuilding |
||||
Highway | 0.830 | Street |
Coast |
Suburb |
Coast |
||||
OpenCountry | 0.510 | Mountain |
Mountain |
Suburb |
Forest |
||||
Coast | 0.770 | OpenCountry |
OpenCountry |
OpenCountry |
OpenCountry |
||||
Mountain | 0.860 | Highway |
OpenCountry |
Coast |
Forest |
||||
Forest | 0.930 | OpenCountry |
OpenCountry |
Street |
Mountain |
||||
Category name | Accuracy | Sample training images | Sample true positives | False positives with true label | False negatives with wrong predicted label |