Project 4 / Scene Recognition with Bag of Words

The aim of this project was to approach image recognition using different forms of image representation and different classifiers such as the nearest neighbour classifier and linear SVMs.

Tiny Images Representation and Nearest Neighbor Classifier

Tiny images representation - Each image was reshaped into a 16x16 image. Following this, the image was modified to have zero mean. This representation is used as a baseline.
Nearest Neighbor classifier - The classifier looks at the test image features, finds the distances of that particular feature to all train image features, finds the index of the closest one, and assigns that corresponding train label to the current image representation. This implementation requires no training and is quite trivial since it only uses 1 NN as well. This method yielded about 20% accuracy in scene recognition.

Scene classification results visualization for Tiny Images and Nearest Neighbor:


Accuracy (mean of diagonal of confusion matrix) is 0.201

Category name Accuracy Sample training images Sample true positives False positives with true label False negatives with wrong predicted label
Kitchen 0.110
Bedroom

Mountain

InsideCity

TallBuilding
Store 0.010
Office

Kitchen

Coast

Bedroom
Bedroom 0.090
Mountain

Kitchen

Coast

Coast
LivingRoom 0.080
Kitchen

Bedroom

TallBuilding

InsideCity
Office 0.200
Kitchen

Bedroom

Coast

Forest
Industrial 0.090
Bedroom

TallBuilding

OpenCountry

Mountain
Suburb 0.210
Store

Mountain

Highway

Coast
InsideCity 0.090
OpenCountry

Kitchen

Coast

Coast
TallBuilding 0.130
Bedroom

Store

Coast

LivingRoom
Street 0.380
Suburb

Kitchen

Coast

TallBuilding
Highway 0.660
Mountain

Store

OpenCountry

Coast
OpenCountry 0.330
Coast

Forest

Highway

Coast
Coast 0.310
Forest

LivingRoom

Forest

Forest
Mountain 0.120
Store

Office

Coast

Coast
Forest 0.210
Kitchen

Store

Coast

TallBuilding
Category name Accuracy Sample training images Sample true positives False positives with true label False negatives with wrong predicted label

Bag of SIFT representation and Nearest Neighbor Classifier

This implementation improved upon our baseline by using a Bag of SIFT representation of images. The first step here is to build a vocabulary of visual words. This was done by sampling local features from the training set and using k-means clustering algorithm on these features to form clusters that make up the vocabulary with their centroids eventually. The number of clusters used in this implementation was the specified size of the vocabulary.
We then represent all our images as 'Bags of SIFTS' by first densely sampling the SIFT descriptors (using a small step size while sampling accounts for the density). We then find which cluster each feature most closely corresponds to and build a histogram of the vocab matches for each of the images. The histogram is normalized and this ultimately gives us our feature representation.
To speed up computation in this section, KNNSearch was used which internally computes nearest neighbour based on KD-tree querying.

This representation was a vast improvement on our baseline, with accuracy increased up to 53.5%.

Code snippet to calculate bag of SIFTS:


for i=1:numImages
    img = imread(image_paths{i});
    hist = zeros(1, vocab_size);
    [locations, SIFT_features] = vl_dsift(single(img), 'step', 6);
    nearest = knnsearch(vocab', SIFT_features');
    rows = size(nearest, 1);
    for j=1:rows
        vocabMatch = nearest(j); %Nearest has the nearest vocab match index for each SIFT feature
        hist(vocabMatch) = hist(vocabMatch) + 1;
    end
    image_feats(i,:) = hist/norm(hist);
end

Scene classification results visualization for Bag of SIFTS representation and Nearest Neighbor Classifier


Accuracy (mean of diagonal of confusion matrix) is 0.535

Category name Accuracy Sample training images Sample true positives False positives with true label False negatives with wrong predicted label
Kitchen 0.350
Industrial

Bedroom

InsideCity

Office
Store 0.410
LivingRoom

InsideCity

Mountain

Forest
Bedroom 0.320
Office

TallBuilding

Street

Suburb
LivingRoom 0.380
Bedroom

Bedroom

Bedroom

Kitchen
Office 0.750
LivingRoom

LivingRoom

Bedroom

Bedroom
Industrial 0.360
TallBuilding

TallBuilding

Office

InsideCity
Suburb 0.850
TallBuilding

Industrial

Coast

Highway
InsideCity 0.380
TallBuilding

Suburb

TallBuilding

Store
TallBuilding 0.400
Mountain

Industrial

InsideCity

Industrial
Street 0.570
TallBuilding

Store

Bedroom

TallBuilding
Highway 0.750
Coast

Street

Coast

OpenCountry
OpenCountry 0.420
Coast

Mountain

TallBuilding

Store
Coast 0.560
OpenCountry

TallBuilding

Highway

OpenCountry
Mountain 0.600
OpenCountry

Industrial

Bedroom

OpenCountry
Forest 0.920
OpenCountry

Mountain

Suburb

Suburb
Category name Accuracy Sample training images Sample true positives False positives with true label False negatives with wrong predicted label

Bag of SIFT representation and Linear SVM Classifier

This part included training a 1-vs-all linear SVM for each feature in the SIFT feature space. This classifier provides an edge over the nearest neighbor in that it falls less prey to being influenced by frequent occurrences of some words, and can better leverage the information it learns and correctly downweight 'irrelevant' dimensions. 15 Binary 1-vs-all SVM's were trained, and then each test image was run through all of these SVM's. The one with the maximum confidence provided the label that would be assigned to this test image. Accuracy improved once more, reaching 66.5%. The entire pipeline took about 9 minutes to completely run, using the feature approximation version of vl_dsift() i.e usage of its 'fast' parameter while building the vocabulary. Different values of the lambda parameter were tried while training the SVM and one that provided good results was 0.0001.

Code snippet to implement SVM classification:


categories = unique(train_labels); 

num_categories = length(categories);
confidences = zeros(num_categories,size(test_image_feats,1));
for i=1:num_categories
    matching_indices = strcmp(categories(i), train_labels);
    matching_indices = double(matching_indices);
    matching_indices(matching_indices == 0) = -1;
    [W,B] = vl_svmtrain(train_image_feats', matching_indices, 0.0001);
    confidences(i,:) = W' * test_image_feats' + B;
end
[~, indices] = max(confidences);
predicted_categories = categories(indices);

Scene classification results visualization with Bag of SIFTS representation and Linear SVM classifier


Accuracy (mean of diagonal of confusion matrix) is 0.665

Category name Accuracy Sample training images Sample true positives False positives with true label False negatives with wrong predicted label
Kitchen 0.660
Bedroom

Office

Mountain

TallBuilding
Store 0.440
Kitchen

Industrial

InsideCity

Mountain
Bedroom 0.480
OpenCountry

LivingRoom

LivingRoom

Suburb
LivingRoom 0.210
Street

Store

Office

Bedroom
Office 0.870
LivingRoom

LivingRoom

Store

LivingRoom
Industrial 0.540
Bedroom

TallBuilding

Office

Store
Suburb 0.950
Mountain

Bedroom

Street

Street
InsideCity 0.520
Street

Store

TallBuilding

TallBuilding
TallBuilding 0.760
Industrial

InsideCity

Street

Store
Street 0.720
InsideCity

LivingRoom

InsideCity

LivingRoom
Highway 0.800
Coast

Store

Coast

Coast
OpenCountry 0.460
Coast

Highway

Coast

Forest
Coast 0.790
Highway

OpenCountry

OpenCountry

Industrial
Mountain 0.820
Store

Store

Coast

Office
Forest 0.950
OpenCountry

OpenCountry

Mountain

Mountain
Category name Accuracy Sample training images Sample true positives False positives with true label False negatives with wrong predicted label

Through the three parts of this project, we see more sophisticated versions of the pipeline being implemented, and see how better image representations and different kinds of classifiers affect the final accuracies of the result. Parameters like deciding the step size while sampling for SIFT features greatly affected the runtime of the entire pipeline. The step size chosen above was done so because it allowed for a reasonably good accuracy while not compromising much on speed.

Extra credit experimentation: Assessing the impact of vocabulary size on accuracy. The following graph shows the plot of scene recognition accuracy against size of the vocabulary used.

The trend is obvious in this graph, and shows that greater accuracy is achieved while using a larger vocabulary size. The reason for this is that we have more visual words to compare to and receive better/closer matches for our nearest matching vocab word when we compute features. There is a tradeoff with time here, though, as larger vocabulary sizes took longer to compute (size 200: 176 seconds while size 25: 34 seconds). This is attributed towards k-means running with a larger number of clusters to create and assign words correctly to.