Project 4 / Scene Recognition with Bag of Words

In this project, we look into using simple classifications and feature extractions to accomplish some basic scene/category recognization.

Part 1. Tiny Images

Tiny images creates descriptors for each image by resizing the image into a small size and then reshape the images into vectors. The repeat the following procedures for each image. The following code converts each vector zero means and unit length to make computing easier.


tiny_img = imresize(img, [16 16]);
tiny_img_vec = reshape(tiny_img, [1, 256]);
tiny_img_vec = tiny_img_vec - mean(tiny_img_vec);
tiny_img_vec = tiny_img_vec/std(double(tiny_img_vec));
image_feats(i,:) = tiny_img_vec;

Part 2. K Nearest Neighbors and Bag of Words

First, we create a bag of words (or vocab) from the training images with the following code:


img = imread(image_paths{i, 1});
[locations, SIFT_features] = vl_dsift(im2single(img), 'step', step);
D = vl_alldist2(vocab.', single(SIFT_features));
[Y,I] = min(D);
image_feats(i,:) = histcounts(I, vocab_size)';

Bag of words simply counts the distribution/histogram of the vocabulary that are in the test image and compare those to the training images.

This code uses k nearest neighbors to classify each image. It does this by assigning each test image to it's nearest trained centeroid. KNN is a different classifier that uses unsupervised learning. It categroizes each test image to the training image that has the most similar descriptor.


for i = 1:N;
    for j = 1:k;
        index = double(I(j, i));
        labels(j) = train_labels(index);
    end
    max = 0;
    max_label = labels(1);
    for j = 1:k l = labels(j);
        matching_indices = strcmp(l, labels);
        if matching_indices > max;
            max_label = l;
        end
    end
    predicted_categories(i) = max_label;
end

Using the tiny image feature extractors and the knn clasifier, the accuracy for the scene recognization increases to around 47%. This takes quite awhile since the step size I used was 5 for creating the bag of sift features and 10 for knn.

Part 3. Linear SVM

Linear SVM is a differnt classifier that categorizes images by drawing a linear model through the vector space and finds the function that separates the classes. The combination of SVM and bag of sifts brought the accuracy up to 57% with a lambda of 0.01. Here is the function for potential models:


conf = w*test_image_feats'+repmat(b,1,num_test);