Project 4: Scene Recognition with Bag of Words

The goal of this project is to recognize 15 categories of scene using 2 different image representations (Tiny images & Bag of SIFT representation) and classification methods (nearest neighbor & linear SVM classifier), respectively. The items below are the required combinations for this project in the order:

  1. Tiny images representation and nearest neighbor classifier.
  2. Bag of SIFT representation and nearest neighbor classifier.
  3. Bag of SIFT representation and linear SVM classifier.

In addition, the following items have been implemented for extra credits:

  1. Experiment with many different vocabulary sizes and report performance. (up to 3 pts)
  2. Cross-validation to measure performance. (up to 3 pts)
  3. Improved nearest neighbor classifier: Naive-Bayes Nearest Neighbors (NBNN). (up to 5 pts)
  4. Additional GIST descriptor. (up to 5 pts)

1.Tiny images + Nearest neighbor

For the tiny images representaion, I resized image to 16x16, and then converted the image into zero mean and uniform standard deviation image as a feature. Nearest neighbor classifier compare the value of each feature of test image with every image in training data set and choose the label of the nearest one as a prediction.

Using this approach, 21.3% of accuracy was obtained.


Tiny images + Nearest neighbor: 21.3% accuracy


2.Bag of SIFT + Nearest neighbor

For the implementation of bag of SIFT, I got SIFT features by using function "vl_dsift" from vl feat to quantize sift descriptor from image with a certain "STEP SIZE". Then, I clustered the features into a certain number (i.e. "VOCAB SIZE") of clusters. Using the cluster centers, a histogram based representation of each training image and each testing image was created.

Note that the following results were obtained after parameter tuning. The parameters I used were as follows: Size of Vocabulary: 400 and Step Size for Histogram Creation: 5

NOTE: The above parameters were chosen using cross validation and using multiple values of vocabulary. This has been described in the extra credit section.

Using this approach, 51.1% of accuracy was obtained.


Bag of SIFT + Nearest neighbor: 51.1% accuracy


3.Bag of SIFT + Linear SVM

To verify the effectiveness of classifier, the nearest neighbor classifier was replaced to 1-vs-all linear Support Vector Machines (SVMs). At evaluation phase, a test image is assigned the label of the classifier with the highest response. The parameters I used were as follows: Size of Vocabulary: 400, Step Size for Histogram Creation: 5, and LAMBDA = 0.0000003.

Using SVMs instead of Nearest Neighbor classifier increased performance to 70.2%.


Bag of SIFT + Linear SVM: 70.2% accuracy


Extra Credits

1. Varying vocabulary sizes (up to 3 pts)

I varied the vocabulary size from 10 ~ 3000 in 12 steps (non-equidistant intervals) with keeping all other parameters the same. Here is the graph of the accuracy I got for the same. In this plot, one can clearly see that accuracy is the best with "VOCAB SIZE" = 400.




2. Cross-validation (up to 3 pts)

I implemented cross validation by making 15 sets out of the dataset (training and testing) as suggested in the project document. I selected 100 random images in each set, for both training and testing. Of course, since I was running a cross validation framework, I did not plot a confusion matrix for this step. After 15 iterations, with my best tuned parameters, I got the following results: "MEAN ACCURACY = 43.2%", "STANDARD DEVIATION OF ACCURACY = 3.2%".

Pseudo-code for cross-validation


fprintf('Getting paths and labels for all train and test data\n')
[train_image_paths_whole, test_image_paths_whole, train_labels_whole, test_labels_whole] = ...
    get_image_paths(data_path, categories, num_train_per_cat);
n=15; %number of sets
order=1:1500; % for 100 samples
neworder=order(randperm(1500)); % a random permutation of your data
neworder=reshape(neworder,numel(order)/n,n); % reshape so that each col is a 'set'
% NOTE: in the reshape, it is assumed that your data divides evenly into the given
% number of sets.
for i=(1:1:n)
    train_image_paths = train_image_paths_whole(neworder(:,i));
    test_image_paths = test_image_paths_whole(neworder(:,i));    
    train_labels = train_labels_whole(neworder(:,i));
    test_labels = test_labels_whole(neworder(:,i));
    % [INSERT REGULAR SCENE RECOGNITION CODE HERE]
end     


3. Naive-Bayes Nearest Neighbors (NBNN). (up to 5 pts)

Based on the paper [1], the algorithm I followed is as follows:

  1. Compute SIFT descriptors (d1, ..., dn) of the query image Q.
  2. For all 'di', and all classes 'C' compute the NN of di in C.
  3. Sum all distances from each feature to its nearest neighbor for every class.
  4. Pick the class with the smallest distance score.

The parameters I used were same as the case 2 ( Size of Vocabulary: 400 and Step Size for Histogram Creation: 5). Compared to normal nearest neighbors classifier, accuracy increased from 51.1% to 56.1%.


	case 'naive bayes nearest neighbor'
        % with the funtion code 'naive_bayes_nearest_neighbor.m' 
        predicted_categories = naive_bayes_nearest_neighbor(test_image_paths, train_labels, test_image_paths);

Bag of SIFT + NBNN: 56.1% accuracy


4. GIST descriptor. (up to 5 pts)

Given an input image, a GIST descriptor is computed by
  1. Convolve the image with 32 Gabor filters at 4 scales, 8 orientations, producing 32 feature maps of the same size of the input image.
  2. Divide each feature map into 16 regions (by a 4x4 grid), and then average the feature values within each region.
  3. Concatenate the 16 averaged values of all 32 feature maps, resulting in a 16x32=512 GIST descriptor.

Intuitively, GIST summarizes the gradient information (scales and orientations) for different parts of an image, which provides a rough description (the gist) of the scene.

While The parameters I used were keeping same (Size of Vocabulary: 400, Step Size for Histogram Creation: 5, and LAMBDA = 0.0000003), I did two things: 1) replace SIFT into GIST (+SVMs) to compare SIFT with GIST directly and 2) add GIST to SIFT (+SVMs) to acheive higher accuracy.


	 %with the function code 'get_gist_features.m'
	 %SIFT + GIST
         train_sift = get_bags_of_sifts(train_image_paths);
         test_sift  = get_bags_of_sifts(test_image_paths);
         train_gist = get_gist_features(train_image_paths);
         test_gist = get_gist_features(test_image_paths);   
         train_image_feats = horzcat(train_sift, train_gist);
         test_image_feats = horzcat(test_sift, test_gist);   

Bag of GIST + Linear SVM: 63.8% accuracy



Bag of (SIFT and GIST) + Linear SVM: 71.0% accuracy


The combiniation of Bag of (SIFT and GIST) + Linear SVM yields the highest accuracy results as below.

Scene classification results visualization


Accuracy (mean of diagonal of confusion matrix) is 0.710

Category name Accuracy Sample training images Sample true positives False positives with true label False negatives with wrong predicted label
Bedroom 0.520
LivingRoom

LivingRoom

Office

LivingRoom
Coast 0.770
OpenCountry

Mountain

OpenCountry

OpenCountry
Forest 0.880
OpenCountry

Mountain

Mountain

Store
Highway 0.870
Street

InsideCity

Coast

Bedroom
Industrial 0.570
Kitchen

Store

Suburb

OpenCountry
InsideCity 0.750
Street

Industrial

Forest

OpenCountry
Kitchen 0.450
Bedroom

Bedroom

Store

Bedroom
LivingRoom 0.500
Bedroom

Kitchen

Bedroom

Store
Mountain 0.780
Forest

OpenCountry

TallBuilding

OpenCountry
Office 0.810
Bedroom

Kitchen

Kitchen

Bedroom
OpenCountry 0.730
Coast

Coast

Coast

Forest
Store 0.540
Industrial

InsideCity

Mountain

TallBuilding
Street 0.820
Highway

LivingRoom

Bedroom

InsideCity
Suburb 0.900
LivingRoom

Store

LivingRoom

Industrial
TallBuilding 0.760
Mountain

Industrial

Kitchen

Store
Category name Accuracy Sample training images Sample true positives False positives with true label False negatives with wrong predicted label

References

[1] Boiman, Oren, Eli Shechtman, and Michal Irani. "In defense of nearest-neighbor based image classification." Computer Vision and Pattern Recognition, 2008 (CVPR 2008), IEEE Conference on, 2008.

[2] Oliva, Aude, and Antonio Torralba. "Modeling the shape of the scene: A holistic representation of the spatial envelope." International Journal of Computer Vision 42(3), pp. 145-175, 2001.