This report contains analysis of various scene recognition techniques. In all the techniques implemented as part of the assignment, being conscious of the free parameters and understanding the impact of changing them was crucial. Most of the report highlights how accuracy varies with change of the free parameters.
Tiny images confusion matrix. Accuracy 0.209
Tiny images has been implemented as a baseline scene recognition. Tiny images are representative of thumbnails of an image where the image is resized 16 x 16 image. Creation of tiny images is a lossy operation because high frequencies are lost. The center square of the image is cropped out and resized to a 16x16 patch. Zero-mean and normalization only helps the accuracy to 0.209 with K of KNN to be 1
The main steps of Bag of words includes -
Bag of words is a methodology borrowed from text analysis - replacing text content with images and words with features. For this section, features used are 128 dimensional SIFT features. Bag of words uses K-means clustering to cluster SIFT features of multiple images. Vocabulary in the bag of words constitutes these cluster centers. Vocabulary is constructed using the training data set. If the training data is representative of the entire data set, we should have good cluster centers found in the vocabulary, for the test data to match with.
%Free parameters for tweaking
STEP_SIZE = 10 % Step size for vl_feat library.
%This has been retained at 10 across all vocabulary sizes since 10 has given reasonable performance and builds the vocabulary in reasonable time.
VOCAB_SIZE = 500 % Size of the vocabulary to be built.
% While this is not necessary to evaluate just the performance of Bag of words and knn, this has been used for analysis in extra credit section
Once the vocabulary has been built, we now need to make our images comparable in a classifier. To achieve this, images are represented as encoded features. Feature encoding is a simple process of finding the nearest cluster centroid in the vocabulary and creating a histogram by features voting for the cluster centroids. The histogram is normalized. Once both testing and training images have been encoded, they can now be classified.
%Free parameters for tweaking
STEP_SIZE = 5 % After multiple tweaks, 5 seems to yield reasonable results. This parameter has not been varied greatly.
The analysis for this part was done with varying vocabulary sizes ranging from 20 - 500. The best acccuracy found was 0.529 for a vocabulary size of 500. The same configuration without normalization was giving an accuracy of just 0.505. While tiny images worked best with K = 1 on the knn, bag of sift gives the best performance with k = 5 This shows that K is a very important free parameter that needs to be tuned to avoid overfitting and generalization. K also varies with the vocabulary size since the vocab size defines the density of clustering and the centroids. Other parameters that had to be tweaked were the vl_sift parameters - namely the step size and the bin size for SIFT algorithm. The step size while building the vocabulary was 10 and while generating the bag of sift features was 10. The bin sizes that were tested ranged from 8,16,32. Below given results are for bin size = 16
K- value | accuracy |
---|---|
1 | 0.418 |
2 | 0.507 |
4 | 0.515 |
5 | 0.529 |
10 | 0.444 |
Bag of sift with knn results webpage
Best bag of sift with accuracy 0.529
Linear SVM has produces far better results than the KNN classifer. Linear SVM as a one-vs-all classifer tries to learn the hyperplane parameters that separate one category from other categories. Apart from vocabulary size and SIFT free parameters that are applicable to all methods mentioned till now, one of the most important parameters for the linear SVM is the lambda. It has been tweaked at multiple levels in the order of 10, the best accuracy of
Results page can be viewed here
Most reliable SVM - without likely overfitting - 0.655
%Free parameters for tweaking
LAMBDA = 0.001 % Influences regularization and thereby overfitting or underfitting.
lambda - value | accuracy |
---|---|
10 | 0.279 |
1 | 0.279 |
0.1 | 0.068 |
0.01 | 0.068 |
0.001 | 0.655 |
0.0001 | 0.669 |
0.00001 | 0.699 |
A specific disadvantage of quantizing vectors is that if there are two clusters that are very close to each other, there is a strict assignment of two closeby features to these clusters where one cluster is strictly favored over the other. When these two features were actually meant to be assigned to the same cluster, this strict assignment reduces accuracy. Remediation - using soft assignment or weighted binning while creating the histogram in the bag of words model i.e. a feature votes for more than one cluster centroid. The weightage I have considered is a Gaussian window weighting for 3-5 clusters.
Code is implemented as part of get_bags_with_soft_assign.m
weightage = transpose(gausswin(K));
num_weights = ceil(size(weightage,2)/2):size(weightage,2);
weightage = weightage(1,num_weights);
for idx = 1:size(image_paths,1)
if mod(idx,100) == 0
disp(idx);
end
image = im2single(imread(image_paths{idx}));
[locations, SIFT_features] = vl_dsift(image, 'fast', 'step', 10);
[idx_array, distances] = knnsearch(vocab, single(SIFT_features'), 'K', 3);
for n=1:size(SIFT_features,2)
for k=1:num_weights
image_feats(idx, idx_array(n,k)) = weightage(k) + image_feats(idx, idx_array(n,k));
end
end
end
Vocabulary size becomes a free parameter because it defines the granularity of the clusters. A very high vocabulary size implies that plenty of cluster centers are present. I have experimented with various vocabulary sizes ranging from 20 - 5000. Since this dataset contains a small number of categories (15), vocabulary size does not create a huge impact beyond certain point. While switching from 200 - 500 is where the major performance increase is visible. Beyond 500, for other vocabulary sizes like 1000 and 5000, (with modifications to other free parameters like sift bin size, step size, k of knn) great performance improvements are not visible. Average accuracy for bag of words (calculated over 5 runs for smaller vocabulary sizes and 3 runs for larger sizes - takes very long to run since it was run without fast parameters) is tabulated below
Code is implemented as part of get_bags_with_soft_assign.m
Vocabulary Size | Average Accuracy |
---|---|
20 | 0.408 |
100 | 0.429 |
200 | 0.46 |
500 | 0.529 |
1000 | 0.605 |
5000 | 0.623 |
Steps to perform Fisher encoding- Using a GMM, build a vocabulary. Use the GMM to encode training features and testing features. Use a linear SVM to classify the encoded training and testing features.
Code is implemented as part of build_gmm_vocab.m and get_fisher_vector.m
num_images = size(image_paths, 1);
% approximately we want to have 10k features for clustering all in all
features_per_image = ceil(10000/num_images);
data = [];
for idx=1:num_images
image = im2single(imread(image_paths{idx}));
[locations, SIFT_features] = vl_dsift(image, 'fast', 'step', 10);
data = [data SIFT_features(:,1:features_per_image)];
end
% [centers, assignments] = vl_kmeans(single(clustering_data), vocab_size);
% vl lib transposes everything with respect to this code
% vocab = centers';
% data = rand(dimension,numFeatures) ;
[means, covariances, priors] = vl_gmm(single(data), vocab_size);
disp(size(means));
disp(size(covariances));
disp(size(priors));
load('means.mat')
load('covariances.mat')
load('priors.mat')
image_feats = [];
data = [];
for idx = 1:size(image_paths,1)
image = im2single(imread(image_paths{idx}));
[locations, SIFT_features] = vl_dsift(image, 'fast', 'step', 5);
fisher = vl_fisher(single(SIFT_features), means, covariances, priors);
image_feats = [image_feats;fisher'];
end
Detailed results web page Fisher Results webpage
Accuracy of 0.765