Scene recognition using bag of words used the idea of recognition of real world scenes that bypasses the segmentation and the processing of individual objects or regions. The procedure is based on a very low dimensional representation of the scene. The model generates feature descriptor for each image based on a multidimensional space in which scenes share membership in semantic categories. The initial work in this respect was introduced in 2001 with 8 scene categories. This was then modified to have 13 categories in the work published in 2005, and finally 15 categories later in 2006. In this project we use the 15 categories to perform scene recognition.
Figure 1: Example images from the scene category database.
In this project we are given a set of train images (with correct labels), and test images. We use various ways to represent the image, followed by two different ways to classify the images.
Representation of images:
The two different classification schemes used are:
In this project we run various combinations of the image representation schemes and classification techniques and provide a comparison in the performance of each of them.
Tiny images uses image downsizing, and using this downsized image as the representation of the image. An important benefit of working with tiny images is that it becomes practical to store and manipulate datasets orders of magnitude bigger than those typically used in computer vision. In this project we compare the results of the following image representations. For tiny image representtion, the image is resized using imresize to the target size. This resized image is then reshaped to a row vector. This row vector is our image's representation. We represent all the train and test images the same way.
We use nearest neighbour classification as our classification technique. Each test vector is then assigned to a class by considering k nearest neighbours. The image is assigned to highest common class among the nearest neighbous considered.
Parameters considered: size of tiny image, number of nearest neighbours.
Figure : Examples of nearest neighbour classification using k = 5 and k = 8.
Figure 2a: Table of accuracies for different tiny image sizes and number of nearest neighbors.
Figure 2b: Confusion matrix of best accuracy = 22.7% with image size 16x16 and 7 nearest neighbors.
The free parameters in this combination of image descriptor and classifier are the tiny image size, and number of nearest neighbours considered for test image classification.
In Bag of SIFT we use dense scale invariant feature transform to obtain the feature descriptors. Once we obtain the descriptors, we obtain a histogram based on its closeness to a set of predefined vocabulary. We first build this vocabulary by using the train dataset. We build the vocabulary by generating sift descriptors of train dataset, randomly sampling few thousand descriptors from these,and clustering them. We use kmeans clustering to cluster the sift descriptors into a fixed number of clusters. In our project, we have used 200 clusters. With lower number of clusters we observed a slight drop in the accuracy of all test cases. The means obtained by kmeans clustering is saved in a vocabulary named 'vocab.mat'.
Figure : Example of Kmeans clustering. The red points are means found after clustering.
Once we have our vocabulary in place, we build our image feature vector by generating a histogram of the count of sift features closest to each of the means in our vocabulary. Such image representation is generated for all the test and train data. We then use nearest neighbours to assign classes to the test vectors, and calculate accuracy based on ground truth.
Figure : Accuracy of various combination of parameters for SIFT representation and classification using k-nearest neighbours with k = 5 and k = 8.
Figure : Confusion matrix for best accuracy = 48% obtained using sift step size = 6, and 7 nearest neighbours.
The free parameters in this combination of image descriptor and classifier are the SIFT step size size, and number of nearest neighbours considered for test image classification.
In Bag of SIFT we use dense SIFT to transform the test and train images and obtain their feature descriptors. We build a histogram of count by using a pre-built vocabulary. The histogram is built by finding the closest vector in the vocabulary to each of the test and train images's SIFT descriptor.
Once we have our SIFT feature for each image in train and test dataset, we classify it using linear SVM. Linear SVM generates a linear binary classifier. Since we have 15 classes, we have to perform 15 one vs all classification. With this, we get 15 binary classfiers. The linear classifiers are of the form WX + B = class. Since it is a binary classifier, the class is either +1 or -1 denoting belongs to class, and does not belong to class respectively. W is the matrix of weights, and B is the vector of weights. An example of linear classifier is shown in figure below.
Figure : Example of linear classification using one vs all SVM.
These classifiers are generated using train images. The same classifiers are used on each of the test images to calculate their score using WX + B where W and B are the same as calculated using the train images. Since it is a one vs all classification, the final classification of each test image is dependent on the score of WX + B of all classes. the highest scoring classifier is the final class assigned to the test image.
Figure : Accuracy of various combination of parameters for SIFT representation and classification using SVM
Figure : Confusion matrix for best accuracy = 60.2%% obtained using sift step size = 6, and lambda = 0.000557.
The free parameters in this combination of image descriptor and classifier are the SIFT step size, and lambda which is the regularization parameter that serves as a degree of importance that is given to miss-classifications. We performed the experiment with fast parameter in vl_dsift, and without it. with 'fast' the runtime was much less, and accuracy was slightly better. This may be due to the fact that fast vl_dsift performs denser sift description extraction.
For extra credit, we train vocabulary of different sizes using SIFT descriptors. For analysis that can be comparable, we use the SIFT step size = 6, and lambda for SVM clssifier = 0.000557 as these correspond to the best accuracy with Vanilla SIFT and SVM vocab size 200. The accuracies we observe are presented in the table below. On analyzing it, we observe that for lower codebook sizes, the accuracy falls steeply below the accuracy for corresponding step size and svm regularization parameter.
Figure : Accuracy of various vocab sizes of SIFT descriptors and classification using SVM.
In spatial Pyramid, we take into account the spatial information about an image as well. Here we divide the image into four or sixteen smaller images, and take individual histograms for each of them. This is shown in the figure below. The histograms are then weighted as 1/4 times for level 1, and 1/16 times for level 2.
Figure : Example of level 0, 1 and 2 of the spatial pyramid.
In this method of image representation, we take into account the histograms of gradients in different parts of the image and thus retain some information of spatial arrangement. on weighing the different histograms, they must be combined. THe combination can be done in many ways like concatenating, stacking or adding. In this project, we add the weighted histograms to finally have one histogram per image.
Figure : Accuracy of various combination of parameters for SIFT representation and classification using SVM
Figure : Confusion matrix for best accuracy = 60.2%% obtained using sift step size = 6, and lambda = 0.000557.
The free parameters in this combination of image descriptor and classifier are the SIFT step size size, and lambda which is the regularization parameter that serves as a degree of importance that is given to miss-classifications. Additionally, we also have the free parameters associated with spatial pyramid viz., number of levels, weight associated with each level, and method of combination of the levels.
SIFT descriptors are local descriptors, i.e. they provide feature descriptors of local patches of image around corner points. GIST is a global descriptor. Intuitively, GIST summarizes the gradient information (scales and orientations) for different parts of an image, which provides a rough description (the gist) of the scene. Thus, in this project we use GIST descriptor with SIFT. There are various ways of combining both the descriptors viz., adding them, weighing them while classifying, etc. In this project we concatenate them horizontally such that they enhance the classification accuracy of vanilla SIFT.
The parameters that we can alter in GIST is number of orientations per scale, number of blocks, and pre filtering. In order to get 128 length feature vector like SIFT, we use a total of 32 scales, 2 blocks, and filter size 4. The other free parameters are those related to linear SVM classification. The cost of miss classified data points within the support vector margin (denoted my lambda) is tuned with values ranging from 10 to 0.00001. A table of accuracies obtained for different combinations of free parameters is shown below.
Figure : Accuracy of various combination of parameters for concatenated GIST - SIFT representation and classification using SVM
Figure : Confusion matrix for best accuracy = 62.2%% obtained using sift step size = 6, and lambda = 0.00065.
The free parameters in this combination of image descriptor and classifier are the SIFT step size size, and lambda which is the regularization parameter that serves as a degree of importance that is given to miss-classifications. Additionally, the parameters that we can tune in GIST is number of orientations per scale, number of blocks, and pre-filtering.
Till now we have used descriptors like SIFT or GIST in its raw form. From here on we see various coding schemes that encode the feature vector and help us obtain a an image representation. Fisher Vector is an image representation obtained by pooling local image features. A Gaussian Mixture Model is found that fits the distribution of image descriptors like SIFT descriptors. The GMM associates each image vector to a mode k in the mixture. a Gaussian Mixture Model can he represented with Mean, Covariance and Priors. This is shown in the figure below. Thus, we save these three parameters as means.mat, covariances.mat and priors.mat which is loaded and used to apply on the test and train images. This helps us generate a Fisher vector, which is our final image representation of a single image.
Figure : Example of Gausian Mixure Model clustering. The means of each cluster is shown as a '+'.
Figure : Example of how the constituting Gaussian distributions are seen in a GMM.
Classification is done using linear SVM. In this method, the fisher vectors are the image representations of the test and train images. After we have found both these Fisher vectors, we train a linear SVM to obtain the weights and offset of a classifier for each class. As a linear SVM is binary classifier, we train 15 classifiers, one for each of the categories. The test images are scored based on the results of the trained classifier. In Fisher, the number of clusters in Gaussian Mixer model is tuned to find the model which gives us best accuracy as shown the the table of accuracy below.
Figure : Accuracy of various combination of parameters for SIFT representation and classification using SVM
Figure : Confusion matrix for best consistent accuracy = 72.9%% obtained using GMM cluster size 30, sift step size = 6, and lambda = 0.000557.
Using this image representation technique, we switch gears slightly from the previous test cases. Here Fisher requires Gaussian mixture models. This requires generating a GMM. The free parameters in this are number of clusters. As we cluster SIFT features to GMM, we have the free parameters of SIFT as well. We continue to use SVM classifier, which gives us regularization parameter to be tuned.
Kernel Codebook encoding is a soft assignment technique that overcomes the hard assignment of codewords in the vocabulary to image feature vectors using SIFT. The hard assignment of codewords to image features overlooks codeword uncertainty, and may label image features by non-representative codewords. The codebook approach merely selects the best representing codeword, ignoring the relevance of other candidates. Here we apply techniques from kernel density estimation to allow a degree of ambiguity in assigning codewords to image features. In our traditional histogram words using vocabulary, a robust alternative to histograms for estimating a probability density function is kernel density estimation. Kernel density estimation uses a kernel function to smooth the local neighborhood of data samples. Here we use a Gaussian Kernel to smooth the distances obtained from SIFT descriptors of an input image to the vocabulary of words. The Euclidean distance is paired with a Gaussian-shaped kernel shown below.
Figure : Gaussian Kernel used for Kernel Code book smoothing of the distances between SIFT descriptors and vocabulary.
In this project we use the gaussian kernel, which has one free parameter i.e. standard deviation (denoted by sigma). In the figure below, we have shown the different accuracies obtained using kernel codebook encoding.
Figure : Accuracy of various combination of parameters for kernel codebook encoding of SIFT descriptors and classification using SVM.
Figure : Confusion matrix for best consistent accuracy = 54.2%% obtained using Kernel standard deviation sigma = 0.1, sift step size = 6, and SVM lambda = 0.000557.
Using this image representation technique, we again switch gears slightly. Here kernel encoding requires a gaussian kernel. The free parameters in this is the variance of gaussian kernel. As we kernelize SIFT features, we have the free parameters of SIFT as well. We continue to use SVM classifier, which gives us regularization parameter to be tuned.
In an attempt to understand the behaviour of the various techniques, we execute one test case with all the image representations combined. As all the features are of different width, we horizontally concatenate them to have a long vector representation for each image. We use three levels of spatial encoding as done in section 4. We observe that the results are uncannily similar to those of only Fisher encoding. To understand this, we observe the feature vector itself of each image. We notice that Fisher vector length is about 7000 (depending on number of GMM clusters). Thus a major chunk of infomation in each vector is provided by fisher encoding, whereas the SIFT and GIST add a negligible 400 dimensions. This is possibly the reason for not observing any change in accuracy on combination of all image representation methods. The table of accuracies is showsn below, with best test case's confusion matrix.
Figure : Accuracy of all combination of image representations, with classification using SVM.
Figure : Confusion matrix for best consistent accuracy = 72.9%% obtained using Fisher encoding of GMM clusters =30, sift step size = 6, and SVM lambda = 0.000557.
Combination: Fisher Encoding (GMM cluster size 30, SIFT step size 6) with Linear SVM Classification (lambda = 0.000557).
Category name | Accuracy | Sample training images | Sample true positives | False positives with true label | False negatives with wrong predicted label | ||||
---|---|---|---|---|---|---|---|---|---|
Kitchen | 0.620 | Bedroom |
Bedroom |
Industrial |
Bedroom |
||||
Store | 0.500 | LivingRoom |
InsideCity |
Bedroom |
Bedroom |
||||
Bedroom | 0.380 | Kitchen |
Industrial |
LivingRoom |
Coast |
||||
LivingRoom | 0.310 | Bedroom |
Industrial |
Bedroom |
Suburb |
||||
Office | 0.700 | Kitchen |
Store |
Kitchen |
InsideCity |
||||
Industrial | 0.270 | InsideCity |
Store |
TallBuilding |
Kitchen |
||||
Suburb | 0.810 | Highway |
OpenCountry |
LivingRoom |
Store |
||||
InsideCity | 0.400 | Suburb |
TallBuilding |
Street |
LivingRoom |
||||
TallBuilding | 0.740 | OpenCountry |
Bedroom |
Coast |
Mountain |
||||
Street | 0.700 | Forest |
Industrial |
Industrial |
InsideCity |
||||
Highway | 0.710 | Industrial |
Industrial |
Office |
LivingRoom |
||||
OpenCountry | 0.480 | Industrial |
Highway |
Suburb |
Coast |
||||
Coast | 0.700 | Industrial |
OpenCountry |
Highway |
Suburb |
||||
Mountain | 0.680 | Industrial |
Store |
Suburb |
Street |
||||
Forest | 0.880 | OpenCountry |
OpenCountry |
OpenCountry |
Suburb |
||||
Category name | Accuracy | Sample training images | Sample true positives | False positives with true label | False negatives with wrong predicted label |
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.