Recognition with Bag of Words

The purpose of this project was to create a classifier for images based on recognizing their scenes. We implemented two methods of getting the features

Features

Tiny Images

Much like the name suggests, this method involves obtaining a much smaller image representation, say 16X16 pixels, from each of the test images. There are two ways to approach this. You can either take a small chunk of the image as your representation or you can scale the image down to a small size. As I believed that there were components of the image that even at a scaled down version would still be important, I opted to scale the images rather than take just a chunk. To make it easier to store multiple images in the same matrix, I converted this 16X16 matrix to a 1X256 matrix.

Bag of Sift

This method involves extracting sift features from the images with vl_dsift, calculating and assigning the sift features to their nearest vocab feature, and constructing a histogram of frequency of vocab use. The vocabulary is a clustering of sift features that were extracted from the training images. There are quite a few paramters that can be tweaked in this step to achieve different accuracies, namely step size and bin size whilst extracting the sift features and the size of the vocabulary. I began with a vocab size of 250, a step size of 22, and a bin size of 4 to prove that my code could work and reduced my steps in increments of 4 until the accuracy of my outputs and the time for my code to run were both within the requirements. Below is the code for how I extract the features and create the histograms for each feature.


    [locations, SIFT_features] = vl_dsift(single(img), 'step', 8, 'size', 4, 'fast');
    distances = vl_alldist2(double(SIFT_features), double(vocab));
    [val indices] = min(distances, [], 2);
    indices = indices';
    dim = size(indices);
    histogram = zeros(1, vocab_size);
    for j = 1:dim(2)
       histogram(indices(j)) = histogram(indices(j)) + 1; 
    end
    histogram = norm(histogram);

As you can see, I ended up with a step size of 6. Below are some of the confusion tables that my iterative process came up with from worst to best accuracy.

Results in a table

Classifiers

1-Nearest Neighbor

This classifier is rightfully named as all it does is find the training image category that on average has the most similar features to the feature being tested. This is where the tiny image or visual words come in.

SVM classifier

This classifer is structurally a lot more complicated than Nearest Neighbor. First, an SVM has to be trained for each of the 15 categories and the hyperplane parameters, W and B, have to be stored for each one. There is a parameter, called lambda, that can be tweaked here to improve accuracy. I began with .001 and dropped by .0002 until reaching a value of .0005 that I found to give me the best accuracy. Below is the code for training one of these SVMs.


    bin_labels = double(strcmp(categories{i},train_labels));
    bin_labels(bin_labels == 0) = -1;
    [W B] = vl_svmtrain(train_image_feats',bin_labels', lambda);%[W B]
    W_matrices(i,:)= W; 
    B_vals(i) = B;

After the training, I iterate through all of the test image features and compute the confidences for each category using the hyperplane parameters found by the code above. The category that has the highest confidence value is the one that is predicted for this test feature. Once again, the code for this loop is found below.


    confidences = [];
    for j = 1:num_categories
        conf = dot(W_matrices(j,:)', test_image_feats(feat,:)) + B_vals(j);
        confidences = [confidences conf];
    end
    [vals ind] = max(confidences);
    predicted_categories(feat) = categories(ind);

Results for Different Feature and Classifier Pairs

The parameters for these results are the same as discussed above.

SVM & Tiny Image Accuracy .144

NN & Tiny Image Accuracy .201

SVM & Bag of Sift Accuracy .652

NN & Bag of Sift Accuracy .501

Results visualization for decently performing recognition pipeline (NN & Bag of Sift).

Accuracy (mean of diagonal of confusion matrix) is 0.501

Category name Accuracy Sample training images Sample true positives False positives with true label False negatives with wrong predicted label

Kitchen 0.410
InsideCity
TallBuilding
Store
Office

Store 0.460
Bedroom
LivingRoom
Kitchen
Suburb

Bedroom 0.160
Kitchen
OpenCountry
Industrial
LivingRoom

LivingRoom 0.300
Bedroom
Office
Bedroom
Store

Office 0.720
LivingRoom
Kitchen
InsideCity
Kitchen

Industrial 0.310
Store
TallBuilding
Store
Kitchen

Suburb 0.820
Office
LivingRoom
Office
Mountain

InsideCity 0.340
TallBuilding
TallBuilding
Coast
TallBuilding

TallBuilding 0.350
Industrial
Kitchen
Store
Bedroom

Street 0.520
Store
TallBuilding
Coast
Store

Highway 0.580
OpenCountry
Bedroom
Industrial
Coast

OpenCountry 0.350
Mountain
Highway
Mountain
Forest

Coast 0.530
Industrial
Highway
Suburb
Office

Mountain 0.600
Suburb
OpenCountry
Industrial
Forest

Forest 0.940
Store
Mountain
OpenCountry
Mountain

Category name Accuracy Sample training images Sample true positives False positives with true label False negatives with wrong predicted label

Madeleyne Vaca

Project 4 / Scene Recognition with Bag of Words