Project 4 / Scene Recognition with Bag of Words

In this project, I implement scene recognition algorithms using different feature representation, feature quantization and classification methods and evaluate the performance and accuracy of different combinations of of methods. The basic combinations are listed below and the extra credit features are discussed later.

  1. Tiny images representation and nearest neighbor classifier.
  2. Bag of SIFT representation and nearest neighbor classifier.
  3. Bag of SIFT representation and 1-vs-all linear SVM classifier.

Tiny images representation and nearest neighbor classifier

This is the simplest configuration of methods. I get a 20% accuracy for this setting.

Tiny images representation

The tiny images representation simply transforms an image to a 16-by-16 array and use it as a 16 * 16 = 256 dimension feature. It also normalize the 256 features for better performance.

Nearest neighbor classifier

The nearest neighbor classifier simply finds the training example with the smallest feature distance from the test image. In this project, I use the one nearest neighbor classifier and use the label of the nearest neighbor as the predicted class for the test image.

Bag of SIFT representation and nearest neighbor classifier

By using a better feature representation method, I can improve the accuracy to 50%.

Bag of SIFT representation

Bag of SIFT representation requires building the visual word vocabularies and extract bag of sift features from training and test images. When building the vocabularies, I sample the sift feature discriptors with a step of 20. The step can be relatively large for building the vocabularies so as to improve the speed. If I use a smaller step such as 15, I can get around 53% accuracy but suffer a large increase in proccessing time. For extracting bag of sift features, I use a smaller step 10 to sample the sift features in an image so that I can have more sift features for building the visual words histogram. For each sift feature, I simply add 1 to the histogram bin representing the visual word cluster centroid closest to it. Finally, the histogram bins are normalized to represent bag of sift features.

Bag of SIFT representation and 1-vs-all linear SVM classifier

By using SVM classifiers with bag of SIFT representation, I get around 62% accuracy.

1-vs-all linear SVM classifier

I use N binary SVM classifiers where N is the number of categories. Each classifier is trained with the training image bag of sift features. It uses an important parameter LAMBDA. By experimenting I find that 0.0003 acheives the best accuracy. A smaller number can acheive an accuracy of 61% - 62% but a LAMBDA larger than 0.0005 drops the accuracy significantly. For each test image, its bag of sift features go through the N SVM classifies and the one with the largest output should be the predicted category of the test image.

Extra credits

Soft assignment for bag of sift features

Instead of making each sift feature vote for a single nearest sift feature cluster centroid, I implemented the soft assginment which adds a value to the histogram bins weighted by the distance between a sift feature to the cluster centroids. The weight is calculated by exp(-(d^2) / (2 * sigma^2)) where d is the distance between the feature and a centroid. The weight is essentially a gassian centered at 0 distance. The farther the distance is the smaller the weight is. Sigma is the most important paramter which controls the shape of the gaussian. Since the distances are of magnitude of 10^5, the sigma should be magnitude of 10^5 as well. By experiment, 200000 seems to be the best value here.

However, the accuracy actually drops to around 45% compared to 62% when using simple assignment. The reason may be that adding weight to all the centroids ntroduces too much noise. Therefore, I modified the method to adding weight only to the N most closest centroids where N is an adjustable parameter. By experiment, I find that 3 is the best number for N. With the improvement, I get an accuracy of 64%, which is slightly better than using simple assignment.

Example of code

Code that computes the weight and does the soft assignment:


% For each image:
    [locations, SIFT_features] = vl_dsift(single(img), 'fast', 'step', 10);
    D = vl_alldist2(vocab', single(SIFT_features));
    for j = 1 : size(D, 2)
        [~, idx] = sort(D(:, j));
        idx = idx(1 : 3);
        e = exp(-((D(idx, j)' .^ 2) ./ (2 * (SIGMA ^ 2))));
        image_feats(i, idx) = image_feats(i, idx) + e;
    end

SVM classifiers with rbf kernel

Since a linear classifier has limitation in high dimensional feature spaces, more complicated kernels such as gaussian-like kernels may do a better job in sperating points in high dimensional feature spaces. Instead of using a linear SVM classifier, I compute an rbf kernel for the SVM classifiers. I use Olivier Chapelle's MATLAB code: http://olivier.chapelle.cc/primal/. The kernel uses a parameter sigma to define its gaussian property. I set it to 50 according to experiments.

Unfortunately, my rbf kernel actually decreases the accuracy to 50%.

Example of code

Code that computes the rbf kernel and trains SVM classifiers:


% calculate the rbf kernels
hp.type = 'rbf';
hp.sig = 50;
X = test_image_feats;
test_K = compute_kernel(1:N, 1:N, hp);
X = train_image_feats;
K = compute_kernel(1:N, 1:N, hp);
% for each category
opt.cg = 1;
[beta, b] = primal_svm(0, labels, LAMBDA, opt);
predicted_labels(i, :) = test_K * beta + b;

Scene classification results visualization


Accuracy (mean of diagonal of confusion matrix) is 0.637

Category name Accuracy Sample training images Sample true positives False positives with true label False negatives with wrong predicted label
Kitchen 0.680
Store

LivingRoom

InsideCity

LivingRoom
Store 0.450
Bedroom

InsideCity

Street

Forest
Bedroom 0.390
Office

Kitchen

TallBuilding

LivingRoom
LivingRoom 0.280
Kitchen

Street

Bedroom

TallBuilding
Office 0.890
LivingRoom

Kitchen

Kitchen

Suburb
Industrial 0.410
Highway

Kitchen

Highway

Mountain
Suburb 0.930
Highway

Forest

Street

Coast
InsideCity 0.460
Store

Store

Highway

Industrial
TallBuilding 0.810
LivingRoom

InsideCity

Industrial

Mountain
Street 0.610
OpenCountry

TallBuilding

InsideCity

Store
Highway 0.770
Store

OpenCountry

Coast

Coast
OpenCountry 0.330
TallBuilding

Industrial

Suburb

Coast
Coast 0.810
Suburb

Highway

OpenCountry

Highway
Mountain 0.820
Industrial

Highway

LivingRoom

Suburb
Forest 0.920
Store

OpenCountry

Mountain

Mountain
Category name Accuracy Sample training images Sample true positives False positives with true label False negatives with wrong predicted label