I built the tiny images representation to be used as a baseline for comparing with the bags of sift representation. This is not a particularly good representation, because it discards all of the high frequency image content and is not especially invariant to spatial or brightness shifts.
My
num_images = size(image_paths,1);
d = 16;
image_feats = zeros(num_images, d*d);
image_paths(1,1)
for i=1:num_images
I = image_paths(i,1);
I = imread(I{1});
X = imresize(I,[d d]);
%normalizing to zero mean and unit variance
% X = X/sum(X(:));
X = X-mean2(X(:));
X = X/std2(X(:));
image_feats(i,:) = X(:)';
end
I then buiilt the 1-NN classifier that finds the nearest training example for every test case and assigns it the label of that nearest training example. As the dimensionality increases, the classifier cannot learn which dimensions are relevant vs. irrelevant. I achieved relatively higher accuracies using the CHI2 distance metric (vs. the default L2- Euclidean) with a difference of about 3% in accuracy.
My
train = train_image_feats';
test = test_image_feats';
%D = vl_alldist2(train, test);
D = vl_alldist2(train, test, 'CHI2');
[~, min_i ] = min(D);
predicted_categories = train_labels(min_i);
I achieved a minimal accuracy of 21.3% Please look at Test 1 for more details.
Getting paths and labels for all train and test data
Using tiny image representation for images
Elapsed time is 5.235921 seconds.
Using nearest neighbor classifier to predict test set categories
Elapsed time is 2.643928 seconds.
Creating results_webpage/index.html, thumbnails, and confusion matrix
Accuracy (mean of diagonal of confusion matrix) is 0.213
I did this by sampling about 900,000 sift features from the training set and then clustering them with kmeans. For each image I randomly picked 600 features (thus 600*1500 = 900,000) The centroids are the visual word vocabulary. I tried with 2 values of k: 50 and 200(performance reported below). Since this was taking about 20 minutes with vl_dsift and step size of 3, I used the 'fast' parameter and step size of 5, for faster run time.
My
num_features = 600;
N = size(image_paths, 1);
combined_features = [];
for i = 1:N
I = image_paths(i,:);
image = imread(I{1});
image = im2single(image);
[~, SIFT_features] = vl_dsift(image,'step',5,'fast');
%size(SIFT_features)
%pick num_features randomly
perm = randperm(size(SIFT_features,2));
rand_i = perm(1:num_features);
SIFT_features = SIFT_features(:, rand_i );
%add them to a big set of features.
combined_features = horzcat(combined_features,SIFT_features);
end
[centers, ~] = vl_kmeans(im2single(combined_features), vocab_size);
vocab = centers';
Now for every image, I counted how many SIFT descriptors fall into each cluster in our visual word vocabulary (nearest neighbor centroid) and built a histogram. Also normalized it so that image size does not dramatically change the bag of feature magnitude.
My
load('vocab.mat')
nbins = size(vocab,1);
N = size(image_paths,1);
image_feats = zeros(N,nbins);
tic
for i = 1:N
I = image_paths(i,:);
image = imread(I{1});
image = im2single(image);
[~, SIFT_features] = vl_dsift(image,'step',5,'fast');
D = vl_alldist2(vocab',im2single(SIFT_features));
[~ , min_i] = min(D);
num_features = size(min_i,2);
sift_rep = (hist(min_i, nbins))/num_features;
image_feats(i, :) = sift_rep;
end
I achieved an accuracy of 54.1% for K=200 and 49.7% for k=50. Please look at Test 2 (k=200) and Test 2 (k=50)for more details.
K= 200, Accuracy = 54.1% | K = 50, Accuracy = 49.7% |
Getting paths and labels for all train and test data
Using bag of sift representation for images
building vocab
Elapsed time is 126.566850 seconds.
getting sift bag
Elapsed time is 99.041953 seconds.
Elapsed time is 99.596549 seconds.
Using nearest neighbor classifier to predict test set categories
Elapsed time is 2.149116 seconds.
Creating results_webpage/index.html, thumbnails, and confusion matrix
Accuracy (mean of diagonal of confusion matrix) is 0.541
I trained 15 binary, 1-vs-all SVMs. 1-vs-all means that each classifier will be trained to recognize 'forest' vs 'non-forest', 'kitchen' vs 'non-kitchen' etc. All 15 classifiers will be evaluated on each test case and the classifier which is most confidently positive "wins". For lambda I tried several values like 0.1, 0.001 and ultimately went ahead with 0.000001 (highest observed accuracy). For this again, I've reported values for K=50, 200 below.
My
categories = unique(train_labels);
num_categories = length(categories);
LAMBDA = 0.000001;
test_score = [];
for i = 1:num_categories
labels = strcmp(categories(i), train_labels);
labels = +labels;
ind = find(labels==0);
labels(ind) = -1;
[W ,B] = vl_svmtrain(train_image_feats', labels, LAMBDA);
test_score = [test_score; W'*test_image_feats' + B];
end
[~ , ind] = max(test_score);
predicted_categories = categories(ind');
I achieved an accuracy of 67.3% for K=200 and 66.3% for k=50. Please look at Test 3 (k=200) and Test 3 (k=50)for more details.
K= 200, Accuracy = 67.3% | K = 50, Accuracy = 66.3% |
Getting paths and labels for all train and test data
Using bag of sift representation for images
No existing visual word vocabulary found. Computing one from training images
Elapsed time is 114.909931 seconds.
getting sift bag
Elapsed time is 102.436123 seconds.
Elapsed time is 103.10002 seconds.
done sift bag
Using support vector machine classifier to predict test set categories
Elapsed time is 7.491537 seconds.
Creating results_webpage/index.html, thumbnails, and confusion matrix
Accuracy (mean of diagonal of confusion matrix) is 0.673
Attempted the following with results shown below:
I achieved an accuracy of 72% for SVM and only 58.5% for Nearest Neighbour. Please look at GIST and SVM and GIST and NN for more details.
SVM, K= 200, Accuracy = 72% | 1-NN, K = 200, Accuracy = 58.5% |
I achieved an accuracy of 72.6%. Please look at Spatial Pyramid for more details.
SVM, K= 200, Accuracy = 72.6% |