Project 4: Scene Recognition with Bag of Words

For this project, the goal was to be able to compute image recognition. Specifically, I would be doing it on the 15 scene database mentioned in Lazebnik et al. 2006. There were two feature-generation methods (Tiny Images and Bags of Sift) and two classifier methods to implement (KNN and SVM). I then had to perform a series of tests on three combinations of these classifier + feature generation methods in addition to the base case of not utilizing any of these combinations. Overall, the worst performing combination to the best performing combinations are shown below

  1. None: 6.8%
  2. Tiny Images + KNN Classifier: 20.8%
  3. Bag of SIFT + KNN Classifier: 51.1%
  4. Bag of SIFT + linear SVM: 64.5%
Above are the best results in each of the combination catagories. Evidently the one that performed the worst was the one that didn't have any of the methods implemented, which yielded about 7%. This is essentially random since there are 15 catagories to choose from and 1/15 is about 7%.

Tiny Images + KNN Classifier

Tiny Images is a very simple image representations and is done to be a baseline comparison for the bag of sift feature representation. It is simply done by resizing a given image to a smaller size. In this case, I am converting it to a 16 x 16 pixel representation, which is then flattened out into a 1 x 256 pixel representation, and then finally stored in a matrix of N x 256 where N is the number of images. One reason why this didn't do amazing is because it is not very good at recognizing spatial changes. In addition to that, it also gets rid of the high frequencies in the image.

KNN basically looks at the 'K' nearest neighbors and classifies the current feature based off of that. It is very fast since it requires no training, but it is very vulnerable to noise. But that can be partially remedied by looking at the larger K values. Because of that, I avoided the 1NN scenario in order to increase accuracy. As seen below, I tested for many different K values up to 10, and saw that 5 performed the best.

K Value Accuracy (% Correct)
1 19.1
2 19.2
3 19.8
4 20.5
5 20.8
6 20.0
7 19.9
8 19.4
9 19.6
10 19.7

Bag of SIFT + KNN Classifier

For this combination, I switched out the Tiny Image Feature representation for a Bag of SIFT representation, which yielded an accuracy of 51.1%. This consists of two parts, which are building the vocabulary and actually getting the bag of sift features as a histogram representation. Essentially, we are creating SIFT descriptors for each image and then performing kmeans on them, with the number of centers being the size of the vocabulary to build the vocabulary. The histogram was build by getting the indices of the closest features, counting the frequency of each of those indices and inserting it as a feature for that image.

for index = 1:num_entries
    img = imread(image_paths{index});
    [~, SIFT_features] = vl_dsift(single(img),'step',7,'fast');    
    D = vl_alldist2(single(SIFT_features), vocab); 
    [~,I] = min(D,[],2);

    % calculating the frequency of each index, and creating a histogram out
    % of it
    current_histogram = zeros(1,vocab_size);
    for i = I' 
        current_histogram(1,i) = current_histogram(1,i) +1;
    end
    image_feats(index,:) = current_histogram./(sum(current_histogram)); %normalizing
end

One implementation choice that I made was the fact that I chose the 'step' of 7 parameters for when I called vl_dsift. This is because it would have taken too much space to go over every single pixel, and thus would cause constant failures since there wouldn't be enough memory to allocate to the function. In addition to that, it would make it run faster as well. The 'fast' parameter allows me to make it run faster, knowing that I am sacrificing some accuracy for drastically decreased performance time.

Bag of SIFT + linear SVM:

This combination of features and classification performed the best. It yielded an accuracy of 64.5%. For this part, the only thing that was different compared to the last combination was that I am using a linear svm classifier instead of a KNN classifier. The svm classifier specifically iterates across every unique classifier, generates binary labels by making all of the matched indices equal to 1 and everything else as a -1. I also finetuned the value of lambda such that it yielded the highest values. I then passed in train_image_feats, the binary labels, and lambda into vl_svmtrain and saved all of the weights and offsets. Finally, I calculated the confidence using the equation, W*X + B where '*' is the dot product and W and B are the learned hyperplane parameters, specifcally the weights and offsets calculated before. I then retrieve the index of the max confidence for that specific test feature, and return the corresponding label.

Lambda Accuracy (% Correct)
0.0001 60.0
0.00001 63.3
0.000001 64.5
As seen above, a lambda value of 0.000001 yielded the best results. One important thing to point out is that while smaller lamba values yielded better results, they also were exponentially slower.

Scene classification results visualization

Bag of SIFTS + SVM


Accuracy (mean of diagonal of confusion matrix) is 0.645

Category name Accuracy Sample training images Sample true positives False positives with true label False negatives with wrong predicted label
Kitchen 0.540
InsideCity

InsideCity

InsideCity

TallBuilding
Store 0.490
InsideCity

LivingRoom

Kitchen

Mountain
Bedroom 0.410
LivingRoom

LivingRoom

Industrial

Office
LivingRoom 0.190
Bedroom

Bedroom

Kitchen

Industrial
Office 0.850
Kitchen

Bedroom

Bedroom

LivingRoom
Industrial 0.470
Kitchen

OpenCountry

OpenCountry

Street
Suburb 0.920
Street

Industrial

Highway

Industrial
InsideCity 0.560
Store

Highway

TallBuilding

Kitchen
TallBuilding 0.790
LivingRoom

Kitchen

Forest

Coast
Street 0.620
Forest

InsideCity

TallBuilding

Bedroom
Highway 0.810
Street

Street

Coast

Forest
OpenCountry 0.500
Industrial

Highway

Highway

Suburb
Coast 0.790
OpenCountry

TallBuilding

OpenCountry

Highway
Mountain 0.830
OpenCountry

OpenCountry

LivingRoom

OpenCountry
Forest 0.910
OpenCountry

OpenCountry

Mountain

Suburb
Category name Accuracy Sample training images Sample true positives False positives with true label False negatives with wrong predicted label