The aim of this project was to understand about the basic bag of words pipeline and about the image/scene recognition. The image features were represented using the downsampling or tiny images and the SIFT features. Two classification models were implemented: k-nearest neighbor model and the linear SVM (1-vs-all SVMs). Fifteen scenes were given for classification. The pipeline was implemented in the following order :

Image Features Representation

Tiny Images

The images were downsampled to the recommended size of 16X16 and were normalized.

SIFT representation of features

The dictionary of visual words was created using the k-means clustering. This dictionary contained the centroids of the clusters. The Scale Invariant Feature Transformation or the features extracted were represented as a histogram. The bin size was used as 8, so as to get the features of dimension 128. One of the parameter which can be varied was the number of features to be retained after implementing vl_dsift. The parameters were changed and the accuracies were measured.

Classifier

k-nearest neighbor

The simplest classifier used for classfication. The distance metric is used to calculate the nearest neighbors and after voting the neighbors corresponding to the category the maximum number of times is assigned as the final label.

SVM Linear Classifier

Support Vector Machine works well and given a line separating the two linearly separated classes. But in this case there were 15 classes to be classified. The technique one vs all was used to get the hyperplane separating all the classes. The SVM was run for the number of classes and the classification was made by using the parameters obtained from all the run.

Results in a table

Tiny Images and k-nearest neighbor

Value of k used	Accuracy
1	20.9
5	20.1
10	21.4
15	21.8
20	21.5
25	20.9
30	20.3

Timing Analysis : As the value of k of increased, more time was taken by the program to execute.

The x-axis shows the value of k and the y-axis shows the accuracy of classification with respect to the changing value of k. The maximum accuracy was obtasined by using k = 15.

Bag of SIFT features and k-nearest neighbor

Size of the bins = 8

Cluster Size/Vocab Size	Value of k used	Accuracy
10	1	37.1%
10	15	44.9%
10	30	46.2%
50	1	45%
50	15	48.7%
50	30	47.1%
100	1	51.6%
100	15	40.9%
100	30	42.3%
200	1	50.1%
200	15	50.2%
200	30	52.0%

The observation was as the size of cluster size was increased the accuracy became better. At the same time as the value of the nearest neighbor was increased, the accuracy improved. But further increasing the value of the neighbors, decreased the accuracy.

Bag of SIFT features and Support Vector Machine

Size of the bins = 8

Cluster Size/Vocab Size	Value of lamda	Accuracy
10	0.0001	45.6%
10	0.001	37.9%
10	0.01	32.9%
10	1	20.1%
50	0.0001	53.6%
50	0.001	50.1%
50	0.01	39.3%
50	1	31.3%
100	0.0001	59.9%
100	0.001	58.6%
100	0.01	40.5%
100	1	38.0%
200	0.0001	53.6%
200	0.0007	56.9%
200	0.0008	59.9%
200	0.0009	57.2%
200	0.00091	56.1%
200	0.00092	55.5%

The x-axis shows the value of lamda and the y-axis shows the accuracy of classification with respect to the changing value of lamda. The maximum accuracy was obtained by using cluster size = 200 and lamda = 0.0008.

Confusion Matrix for the best result obtained with bag of SIFTs and SVM for the cluster size 200 and the value of lamda = 0.0008.

The SVM's accuracy is very sensitive to the value of lamda.It is very important to choose the correct value of lamda along with the cluser size in order to get the best accuracy. There are a lot of free parameters which can be tuned to get good results. Finally when the results were not satisfactory with the cluster size of 100, for cluster size of 200, a beam search on the values of lamda was done and the values were narrowed down in the range 0.0007-0.001. After getting the range the value of lamda was further tuned to get the final value.

Extra Credit

Features at multiple scales

For the baseline, the features were extracted from the images given to us. The results were decent but more the number of features better can be the results. So the features were extracted at multiple scales and the results were run for the scales individually and when all the features from all the scales were combined together. There was a change in the result but not a very drastic change in it.

Level of the pyramid	Accuracy
Level 0	59.7%
Level 1	61.5%
Level 2	58.8%

The confusion Matrix showing the results obtained by combining the fatures of the original image with the first level downsampled image. The accuracy obtained was 61.5%

Add complementary features

GIST features were added to the SIFT features before sending the features to the classifier. The gist_descriptors were calculated using LMgist function. This functiona normalizes the features, so separate normalization is not required. This improved the accuracy of categorization. The cluster size was considered to be 200.

k-nearest neighbour

Value of the neighbors (k) considered	Accuracy
1	61.1%
15	62.4%
30	62.7%

SVM classifier

Value of Lamda	Accuracy
0.0009	68.6%
0.001	69.8%
0.002	69.5%
0.003	66.9%

Kernel Book Encoding

This is the technique of allowing a degree of flexibility while assigning the visual features to a single codeword as opposed to the hard assignment. This soft assignment technique improves the accuracy while categorizing the datasets. In this technique, the distance of each feature to all the centroids was computed and the weights were assigned to according to the distance to the centroid (lesser the distance more the weight and vice-versa) and finally the sum over all number of the descriptors was taken to create the histogram. Although I expected the accuracy to increase with this encoding but surprisingly it reduced. The categorization accuracy achieved with SVM and vocab size of 200 and lamda 0.0007 was 48.5%. For 100 clusters it came down to 48.0%.

Feature Encoding using Fisher Encoding

The features can be encoded using more sophisticated feature encoding techniques. The vocabulary was built using the GMM and then the features were extracted using the Fisher Encoding. The results were drastically improved using this.

Value of Lamda	Accuracy with Fisher Encoding	Accuracy with Fisher Encoding and Chi-Squared Kernel for SVM
0.00072	73.3%	77.9%
0.00075	73.5%	77.2%
0.0009	73.1%	77.7%

SVM with more sophisticated kernels (Kernels used : Chi-squared and RBF kernel)

Using more sophisticated kernel meant the improvement in the accuracy, but the results did not improve drastically.

The following table presents the results using the Chi-squared kernel.

Value of Lamda	Accuracy without including the GIST descriptors	Accuracy with including the GIST descriptors
0.0004	50.30%	59.10%
0.0005	48.30%	63.50%
0.0007	48.70%	63.33%
0.0009	53.20%	63.10%

Cross-validation to measure performance

Randomly picked up 100 images from the training dataset and 100 images from the testing dataset. For a step size of 8 and vocab size of 200, extracted the features and measured the accuracy of categorization for 10 different iterations, each time randomly 100 different images were picked up. Finally, the average and the standard deviation of the performance was measured. This was run for a step size of 8 for extracting the features and the the value of lamda used for SVM classifier was 0.00072.

No of iterations	Accuracy
1	62.6%
2	60.10%
3	59.30%
4	60.70%
5	60.50%
6	60.50%
7	61.40%
8	62.0%
9	60.70%
10	62.30%

The x-axis shows the number of iterations and the y-axis shows the accuracy of categorization. The average performance reported was 61.01% and the standard deviation achieved was 0.0104. From the accuracy we can see the variation because at different time different images will be picked up. Sometimes some features of a particular category can be seen while others will be missed. So we get random accuracies everytime.

Also, for cross validation and for final fine tuning of the parameters, 100 images from the test data set were kept separately. The pipeline was trained for 1500 training images and was tested on 1400 test images. Once the cluster size was set to 200 and the value of lamda was set to 0.00072, finally the results were run on the 100 separately kept images.

Experimentation with different vocabulary size

The vocabulary size of 10, 50, 100 and 200 was used to check the results. The results are reported above with the normal pipeline.

Scene classification results visualization

Accuracy (mean of diagonal of confusion matrix) is 0.776

Category name Accuracy Sample training images Sample true positives False positives with true label False negatives with wrong predicted label

Kitchen 0.610
LivingRoom
Industrial
Bedroom
Bedroom

Store 0.810
LivingRoom
LivingRoom
LivingRoom
InsideCity

Bedroom 0.580
LivingRoom
TallBuilding
Industrial
LivingRoom

LivingRoom 0.530
Bedroom
Bedroom
Bedroom
Kitchen

Office 0.900
LivingRoom
Industrial
Kitchen
LivingRoom

Industrial 0.620
InsideCity
Coast
TallBuilding
Mountain

Suburb 1.000
InsideCity
OpenCountry

InsideCity 0.810
Industrial
Store
Suburb
Store

TallBuilding 0.830
Industrial
Store
LivingRoom
Mountain

Street 0.880
InsideCity
Mountain
InsideCity
Industrial

Highway 0.890
Industrial
Coast
Kitchen
Coast

OpenCountry 0.640
Coast
Coast
Forest
Mountain

Coast 0.750
OpenCountry
TallBuilding
OpenCountry
OpenCountry

Mountain 0.840
OpenCountry
TallBuilding
OpenCountry
Coast

Forest 0.950
Mountain
OpenCountry
Mountain
Mountain

Category name Accuracy Sample training images Sample true positives False positives with true label False negatives with wrong predicted label

Agrata Kumar

Project 4 / Scene Recognition with Bag of Words