Project 4 : Scene Recognition with Bag of Words

Scene recognition is one of the interesting problems in computer vision. In this project, we examined the task of scene recognition starting with very simple methods -- tiny images and nearest neighbor classification -- and then moving on to more advanced methods -- bags of quantized local features and linear classifiers learned by support vector machines. Training set: 100 labelled images of each of the 15 scene categories. Test set: Different 100 images for each of 15 scene categories. The feature representation techniques used are

  1. Tiny images
  2. Bag of sifts
  3. Gist along with Sift descriptors
And the classification techniques used are
  1. Nearest Neighbor classifier
  2. Support Vector Machine

Tiny Images and 1-Nearest Neighbor

Both the training and test images are converted into tiny images. The image is read and resized such that the width and height are 16 pixels. The resized 2d image array is then reshaped into a single dimensional array that has 16x16 elements. The mean of the tiny image values are made as 0 by subtracting the average. Then the tiny image is normalized by dividing by the maximum intensity value. The normalized tiny image is stored as the image feature which is then classified using 1-Nearst Neighbor.

Nearest Neighbor classifier finds the image feature that is closest in terms of distance to the test image features. The distance between the test image features and the training image features is computed using vl_alldist2 function. Then for each test image feature the training image feature that has the minimum distance value is found. The image type of the found training image feature is labelled as the type of the test image.

Confusion Matrix for Tiny Images and 1-Nearest Neighbor

This technique is quite fast. The accuracy was found to be 0.207 which is within range of the suggested accuracy level (15-25%).

Bag of Sift words and 1-Nearest Neighbor

Bag of visual words is nothing but a collection of modified sift-features(words) gathered from set of training images. Inorder to create bag of sift words, a vocabulary is built. The sift features of the training images are computed using vl_dsift function. Kmeans clustering is performed using vl_kmeans fucntion and with a given vocabulary size. The cluster centroids are obtained. These centroid are further used to segregate and frame the bag of words for each image. The vocabulary size was chosen as 400. It was also tested with size 200. The step size that gave the maximum accuracy was found to be 8. The other step sized that were experimented were 4, 16 and 32.

The images of the training and the testing are read and the sift descriptors are calcualted. The distance betweent the sift and vocabulary that was built earlier is found. The closeness to a centroid in the vocabulary is used to identify the type of the sift descriptor in the bag and the count of the bag type is incremented. The bags form the feature of the image. The bags count (or the histogram) is normalized. The step size used is 10 inorder to find sift descriptors in a dense manner. Step size of 4 was found to be too slow and accuracy did not increase significantly when compared to 10. And Nearest Neighbor classifier is used to find the test image type using the found histogram (image features) of the training and test images as explained earlier.

Confusion Matrix for Bags of SIFTs and 1-Nearest Neighbor

This technique is not so fast. The accuracy was found to be 0.515 which is withing the range of the suggested accuracy level.

Bag of Sift words and Support Vector Machine

The vocabulary built earlier was used here. The bag of sift words was also calculated as mentioned above. Support Vector Machine classifies images based on the confidence scores. The score is calculate using the formula W*X+B where W and B are computed for every category of image using vl_svmtrain function. For each category all the training images of that category are input with label 1 and everything else with label -1 to the vl_svmtrain function. Once W and B are obtained for every category, the test images feature are substituted in place of X in the above said formula and the confidence score is calculated. For every test image, the category that has the highest confidence score is selected as the category of that test image. Accuracy of SVM depends highly on the lambda value chosen. By trial and error from 0.00001, 0.0001, 0.0005, 0.001, 0.005, 0.01, 0.1, 1, 10 it was found that 0.001 gave the highest accuracy.

This technique is again not fast. The accuracy was found to be 0.624 which is within the suggested accuracy level range.

Scene classification results visualization


Accuracy (mean of diagonal of confusion matrix) is 0.624

Category name Accuracy Sample training images Sample true positives False positives with true label False negatives with wrong predicted label
Kitchen 0.500
Office

Store

Bedroom

Bedroom
Store 0.600
Bedroom

Industrial

LivingRoom

Bedroom
Bedroom 0.560
Office

LivingRoom

Kitchen

Kitchen
LivingRoom 0.250
Bedroom

Industrial

Office

Bedroom
Office 0.830
InsideCity

Store

Bedroom

Kitchen
Industrial 0.390
Mountain

Kitchen

Office

Store
Suburb 0.930
OpenCountry

OpenCountry

Street

TallBuilding
InsideCity 0.350
Store

Street

Kitchen

Bedroom
TallBuilding 0.750
Industrial

LivingRoom

Bedroom

Coast
Street 0.580
Bedroom

TallBuilding

InsideCity

TallBuilding
Highway 0.760
Street

Street

OpenCountry

Office
OpenCountry 0.340
Street

Coast

Suburb

Coast
Coast 0.810
Highway

OpenCountry

Highway

Industrial
Mountain 0.800
Forest

Forest

Bedroom

Coast
Forest 0.910
Mountain

TallBuilding

Store

Street
Category name Accuracy Sample training images Sample true positives False positives with true label False negatives with wrong predicted label

Extra Credits

Using Gist Descriptors along with SIFT descriptors

GIST features bypass the processing of individual objects or regions and try to describe the entire scene as a set of global features (512 features).The procedure is based on a very low dimensional representation of the scene, Spatial Envelope. I used LMgist method from gistdescriptor library which I have attached with the original project code. I chose 8 orientations/scale on 4 scales. Hence, 32 filters were used.And 4 blocks => 16 regions. Hence, in all 32*16 = 512 feature long GIST descriptor was obtained. The found 512 Gist descriptor was appended with the sift descriptors before classifying. This technique was very slow since gist descriptors took time to computed. The accuracy was found to be 0.690 (with svm) which was pretty good as compared to 0.624 without the gist features. The performance of the spatial envelope model shows that specific information about object shape or identity is not a requirement for scene categorization and that modeling a holistic representation of the scene informs about its probable semantic category.

Using Fisher Kernel along with SIFT descriptors

Reference
Fisher Kernel extends the bag-of-visual-words by going beyond count statistics. It also uses Means, Covariances and Priors calculated using the SIFT descriptors only. The fisher kernel has shown superior results in some of the experiments with Flikr and ImageNet, but it is known to be not gaining a lot of accuracy gain generally. I wanted to see the results on our set of images. build_vocab_fisher.m


% Fisher extra attributes
num_of_clusters = 35;
[Means, Covariances, Priors] = vl_gmm(all_features, num_of_clusters);

get_bags_of_sift_with_fisher.m


function  image_feats = get_bags_of_sifts_with_fisher(image_paths, Means, Covariances, Priors)
     image_feats = get_bags_of_sifts(image_paths);
     image_feats = vl_fisher(image_feats, Means, Covariances, Priors);
end

( Could not report the results for Fisher Kernel as it was still running)