Project 4 / Scene Recognition with Bag of Words

This project evaluates different methods and pipelines for doing scene recognition using a classic bag of words approach.

Tiny Images and Nearest Neighbor

This is the simplest approach. The features are input images scaled down to 16x16 pixel and then zero meaned and normalized to a unit vector.


image = imresize(image, [image_size image_size]);
image = image(:);

% zero mean and unit length
image = double(image);
image = image / sum(image);
image = image - mean(image);

image_feats(i,:) = image;

By calculating the K-Nearest Neighbors for a test image to the training images and using the most common label of the neighbors we can predict the label of the test image. Using this approach with K=1 I got 0.199 accuracy.


K = 1;
[D, I] = pdist2(train_image_feats, test_image_feats, 'euclidean','Smallest',K);
if K > 3
    I = mode(I);
end
predicted_categories = train_labels(I);

Bag of SIFT and Nearest Neighbor

By extracting features using SIFT rather than just miniturizations the accuracy will improve further. This is done in two steps.

First we build a vocabulary of less dense SIFT features from all our training images. These are then clustered into K clusters.


FEATURE_STEP_SIZE=5;
FEATURE_SIZE=3;

all_features = [];
for i=1:size(image_paths)
    path = image_paths{i};
    image = single(imread(path));
    [positions, features] = vl_dsift(image, 'fast', 'Step', FEATURE_STEP_SIZE, 'Size', FEATURE_SIZE);
    % features is d x N
    all_features = [all_features, features];
end

[centers, assignments] = vl_kmeans(single(all_features), vocab_size);
vocab = centers';

K here is the given parameter vocab_size. I use a FEATURE_STEP_SIZE=5 and FEATURE_SIZE=3 this gives a balance between speed and accuracy.

The next step is to extract dense featurs from the test image and their nearest neighbors in amongst the training image vocabulary.


path = image_paths{i};
image = single(imread(path));
[positions, features] = vl_dsift(image, 'Fast', 'Step', FEATURE_STEP_SIZE, 'Size', FEATURE_SIZE);
features = double(features);

[D, I] = pdist2(vocab, features', 'euclidean','Smallest', 1);
% [I, D] = vl_kdtreequery(forest, vocab', features);

% create histogram
hist = zeros([vocab_size 1]);
for j=1:size(I, 2)
		cluster = I(j);
		hist(cluster) = hist(cluster) + 1;
end
hist = hist / sum(hist);
image_feats = [image_feats, hist];

Experimentally I found that FEATURE_STEP_SIZE=5 and FEATURE_SIZE=3 gave a good balance between speed and accuracy. This yields an accuracy of 0.509.

Bag of SIFT and Linear SVM

This is the final improvement using 1 vs all linear SVM classifiers. That is, for each of the 15 categories a SVM is trained to classify between "category X" and "not category X". Then for each test image all SVM give their prediction and the most certain result is used.


% build classifiers
W = [];
B = [];
for i=1:num_categories
    matching_indices = strcmp(categories(i), train_labels);
    labels = ones([N 1]) * -1;
    labels(matching_indices) = 1;
    labels = double(labels);
    [w b] = vl_svmtrain(train_image_feats', labels, LAMBDA);
    W = [W, w];
    B = [B, b];
end

% classify data
M = size(test_image_feats, 1);
predicted_categories = cell(M, 1);
for i=1:M
    Y = W' * test_image_feats(i,:)' + B';
    [M, label_id] = max(Y);
    predicted_categories{i} = categories{label_id};
end
predicted_categories

LAMBDA = 0.000001 was experimentally showing to give the best result together with FEATURE_STEP_SIZE=2 and FEATURE_SIZE=3 when extracting the dense SIFT features from the test images. The resulting accuracy was 0.709 and the result is vizualized below. This however is a bit slow, and by setting FEATURE_STEP_SIZE=5 we decrease the time drastically but still get an acceptable accuracy of 0.68.

Scene classification results visualization

Accuracy (mean of diagonal of confusion matrix) is 0.705

Category name Accuracy Sample training images Sample true positives False positives with true label False negatives with wrong predicted label

Kitchen 0.670
Store
Bedroom
LivingRoom
Bedroom

Store 0.590
Street
Kitchen
Kitchen
Mountain

Bedroom 0.500
Kitchen
Kitchen
Office
Office

LivingRoom 0.310
Store
Kitchen
Kitchen
Kitchen

Office 0.910
LivingRoom
LivingRoom
Kitchen
Kitchen

Industrial 0.560
LivingRoom
Bedroom
Kitchen
OpenCountry

Suburb 0.980
Mountain
Highway
InsideCity
Highway

InsideCity 0.650
Street
Coast
Store
TallBuilding

TallBuilding 0.820
Industrial
Industrial
Mountain
Industrial

Street 0.690
TallBuilding
Store
InsideCity
InsideCity

Highway 0.800
Industrial
Industrial
Mountain
Store

OpenCountry 0.520
Coast
Coast
Forest
Coast

Coast 0.810
OpenCountry
OpenCountry
Highway
OpenCountry

Mountain 0.820
Highway
Highway
Forest
OpenCountry

Forest 0.950
Store
Store
Store
Mountain

Category name Accuracy Sample training images Sample true positives False positives with true label False negatives with wrong predicted label

Category name	Accuracy	Sample training images	Sample true positives	False positives with true label		False negatives with wrong predicted label
Kitchen	0.670			Store	Bedroom	LivingRoom	Bedroom
Store	0.590			Street	Kitchen	Kitchen	Mountain
Bedroom	0.500			Kitchen	Kitchen	Office	Office
LivingRoom	0.310			Store	Kitchen	Kitchen	Kitchen
Office	0.910			LivingRoom	LivingRoom	Kitchen	Kitchen
Industrial	0.560			LivingRoom	Bedroom	Kitchen	OpenCountry
Suburb	0.980			Mountain	Highway	InsideCity	Highway
InsideCity	0.650			Street	Coast	Store	TallBuilding
TallBuilding	0.820			Industrial	Industrial	Mountain	Industrial
Street	0.690			TallBuilding	Store	InsideCity	InsideCity
Highway	0.800			Industrial	Industrial	Mountain	Store
OpenCountry	0.520			Coast	Coast	Forest	Coast
Coast	0.810			OpenCountry	OpenCountry	Highway	OpenCountry
Mountain	0.820			Highway	Highway	Forest	OpenCountry
Forest	0.950			Store	Store	Store	Mountain
Category name	Accuracy	Sample training images	Sample true positives	False positives with true label		False negatives with wrong predicted label

Erik Gartner

Project 4 / Scene Recognition with Bag of Words

Tiny Images and Nearest Neighbor

Bag of SIFT and Nearest Neighbor

Bag of SIFT and Linear SVM

Scene classification results visualization