Project 4 / Scene Recognition with Bag of Words

In this project, we are tasked to perform scene recognition using three different combinations of image representation methods and classification methods. They are implemented in the following order:

  1. Tiny image representation and nearest neighbor classifier
  2. Bag of SIFT representation and nearest neighbor classifier
  3. Bag of SIFT representation and linear SVM classifier

Tiny image representation and nearest neighbor classifier

Computation time = <1 minute

Accuracy = ~20.5%

In the first combination, the tiny image representation method and nearest neighbor classifier were implemented. Tiny image is a very basic representation of an image through image resizing. Each image is resized to a 16 pixel by 16 pixel tiny image feature before every pixel is subtracted by its mean, normalized, and arranged into a 256-entry row vector, which would become the feature representation of the image. However, the high frequency features are ignored and it is not invariant to spatial and brightness shifts. The nearest neighbor classifier classifies a test feature(image) into a particular category based on the L2 distance between each test image and the training images. The test image is assigned the label of the single nearest training image if parameter is k. If k is more than one, then the test image is assigned the label based on voting among the k training images "nearest" to the test image. k is assigned the value 1 as a start.

Example of code for get_tiny_images.m

for i=1:1:length(image_paths)
image = imread(cell2mat(image_paths(i)));
tinyimage = double(imresize(image,[16 16]));
tinyimage = tinyimage - mean(mean(tinyimage));
tinyimage = tinyimage/norm(tinyimage);
image_feats(i,:) = reshape(tinyimage,[1 256]);

Example of code for nearest_neighbor_classify.m

predicted_categories = cell(length(test_image_feats), 1);
k = 1;
dis = vl_alldist2(test_image_feats',train_image_feats');
for i=1:1:size(dis,1)
    [mindis,index] = sort(dis(i,:));
    nearest = index(1:k);
    x = train_labels(nearest);
    % x = {'apple','lemon','apple','peach'};
    y = unique(x);
    n = zeros(length(y), 1);
        for iy = 1:length(y)
            n(iy) = length(find(strcmp(y{iy}, x)));
    [frequency, itemp] = max(n);
    predicted_categories{i,1} = y{itemp};

Bag of SIFT representation (level 0) and nearest neighbor classifier

Computation time = ~5 minute

Accuracy = ~50.3%

In the second combination, nearest neighbor (k=1) and bag of SIFT representation was implemented. Before we are able to represent image of our images as a bag of feature histograms, we need to form a group of vocabulary words. This can be done by sampling the 128 dimentional SIFT features from a set of training images and clustering the SIFT features using kmeans. The vl package of SIFT feature extraction function was used with the sampling density set to 10 pixels. The number of kmeans cluster centers then becomes the size of our vocabulary and therefore also the size of our feature histogram that represents each image. The cluster center number is free to adjust and was set to 200 vocab entries as a start. Then the SIFT features from each training and test images are grouped into the various cluster center to form a feature histogram that represents each image. The vl package of SIFT feature extraction was again used to obtain the SIFT features and their corresponding locations in each image. The nearest neighbor classifier was then used to process the training and test images. There are several parameters that we can adjust when using the SIFT feature extraction function, such as the sampling density/step size. Step size was set to 10 pixels as a start.

Example of code for build_vocabulary.m

sum_feats = [];
for i = 1 : size(image_paths, 1)
image = single(imread(cell2mat(image_paths(i))));
[locations, SIFT_features] = vl_dsift(image,'fast', 'Step', 10); %[location, features] = vl_dsift(single(image), 'fast', 'Step', 5);
SIFT_features = single(SIFT_features);
sum_feats = horzcat(sum_feats,SIFT_features);
[centers, assignments] = vl_kmeans(sum_feats,vocab_size);
vocab = centers';

Example of code for get_bags_of_sifts.m

vocab_size = size(vocab, 1);

image_feats = [];

for i = 1:size(image_paths, 1)
clear SIFT_features
clear locations
histogram = [];

image = single(imread(cell2mat(image_paths(i))));
histo_cluster = zeros(1,vocab_size);
[locations, SIFT_features] = vl_dsift(image,'fast', 'Step', 5); %[location, features] = vl_dsift(single(image), 'fast', 'Step', 5);
SIFT_features = single(SIFT_features);

dis = vl_alldist2(SIFT_features,vocab');
for j=1:1:size(dis,1)
    [mindis,index] = sort(dis(j,:));
    histo_cluster(index(1)) = histo_cluster(index(1))+1;
histo_cluster = double(histo_cluster);
histo_cluster = histo_cluster/norm(histo_cluster);

image_feats = vertcat(image_feats,histo_cluster);

Bag of SIFT representation (level 0) and linear SVM classifier

Computation time = ~5-6 minutes

Accuracy = ~60.5%

In the third combination, bag of SIFT representation (vocab size = 200; step size = 10 pixels) was implemented while the nearest neighbor classifier is replaced by the 1 vs all SVM classifier. The way to code this classifier can basically be divided into a training section and a testing section. The training section uses vlfeat function that trains a linear support vector machine based on the training image set and relate them through two parameters W and B to the training labels (binary, either 1 or -1) that are already defined. In other words, we try to find a specific group of W vector and B value that would define (either yes or no (binary definition)) images that belong to the same label. Lambda, a parameter for the vl_svmtrain function, is an important parameter that is to be experimented with and was set as lambda = 0.001 as a start. The testing section involves calculating the product between parameter w, each image feature representation and B and determining which group of W and B would allow this product for a particular image to have the positive sign. Once this group of W and B is determined, then we would predict that the image belong to the label that is related to the group of W and B.

Example of code for svm_classify.m

categories = unique(train_labels);
num_categories = length(categories);

predicted_categories = cell(size(test_image_feats,1), 1);
LAMBDA = 0.001;
W_all = [];
B_all = [];
for i = 1:1:num_categories
%TRAINING and RELATING the training images to train label
binary_label = strcmp(categories{i}, train_labels); 
binary_label = double(binary_label);
for k=1:1:length(binary_label)
    if binary_label(k)==0
    binary_label(k) = -1;

[W B] = vl_svmtrain(train_image_feats', binary_label, LAMBDA);
W_all = vertcat(W_all, W')
B_all = vertcat(B_all, B)

%TESTING and PREDICTING the test images
for i = 1 : size(test_image_feats, 1)
    testlabelall = [];
    for j = 1 : size(W_all, 1)
        testlabel = W_all(j, :)*test_image_feats(i, :)' + B_all(j);
        testlabelall = vertcat(testlabelall,testlabel);
    [labelsorted INDEX] = sort(testlabelall,'descend');
    predicted_categories{i, 1} = categories{INDEX(1)};

Analysis and Discussion

1) Different k nearest neighbors were attempted as well for the bag of SIFT (level 0; step size = 10 pixels; vocab size = 200) and nearest neighbor classifier and the results are as follows:

The k value does not have a significant impact on the result within my test range. The accuracy result is also very consistent over three trials for each k value.

k Accuracy (%)
1 50.3
2 46.8
3 50.1
4 51.5
5 51.9
10 53.3
20 51.5
50 48.8
100 47.5

2) Different step sizes in the vl SIFT function were attempted and the smaller the step size, the higher the sampling density, the more SIFT features were obtained. When the step size decreases from 10 pixels to 5 pixels for bag of SIFT and nearest neighbor (k=1), the accuracy increases by ~4% from 50.3% to 54%. For bag of SIFT (level 0) and linear SVM (vocab size = 200, lambda = 0.001), the results are as follows:
Step size (pixels) Accuracy (%)
20 53.1
15 56
10 60.5
5 62.2
3 62.6

3) Different lambda values were attempted as well for the bag of SIFT (level 0; step size = 10 pixels; vocab size = 200) and linear SVM classifier and the results are as follows: It should be noted that the accuracy results are not consistent. For each lambda, multiple trials can lead to accuracy results that fluctuate by ±10%. Three trials were conducted for each lambda and the average was stated in the table.
Lambda Accuracy (%)
0.00001 58.9
0.0001 63.7
0.001 60.5
0.01 52
0.1 48.4
1 38.1
10 44.4

Extra Credit

All the extra credit works are done using the combination of bags of SIFT and linear SVM and spatial pyramid level 1, if not stated otherwise. . Other parameters are as follows: SIFT feature extraction step size = 5 pixels, vocab size = 200, lambda = 0.001, if not stated otherwise.
  1. Spatial pyramid (level 1) and Spatial pyramid (level 2)
  2. The spatial pyramid representation for the SIFT features was implemented by making use of the locations information of each SIFT features extracted from each image. Each image is divided into 4 quadrants and the features for each quadrant are distributed into the cluster centers to form a partial histogram. The complete feature histogram representation for one image concatenates the partial histogram vector from all four quadrants. This allows the accuracy to improve from 62.2% (level 0; step size = 5 pixels) to 73.1% (level 1; step size = 5 pixels). Spatial pyramid level 2 was also implemented but did not lead to improved accuracy. Instead the accuracy dropped to 72.2% The get_bags_of_sifts.m function is divided into three sections and labeled as such in the code. User can uncomment certain section to implement the desired spatial pyramid level (either level 0, level 1 or level 2).

  3. Experiment with different vocabulary sizes
  4. The results clearly show that the increase in vocabulary size improves the accuracy but the accuracy improvement becomes insignificant once the vocabulary size goes past 500.
    Vocabulary size Accuracy (%)
    10 45.6
    20 57.9
    50 66.2
    100 70.6
    200 73.1
    500 76.3
    1000 76.7
  5. Experiment with 100 random training and test images
  6. 10 tests were done using random training and test images for each category shown in the table below. "randperm" function in MATLAB was used to randomize the selection of images.
    %% RANDOM - randomly pick 100 images from each set    
    indexselected = randperm(1500,100);
    train_image_paths = train_image_paths(indexselected);
    test_image_paths = test_image_paths(indexselected);
    train_labels = train_labels(indexselected);
    test_labels = test_labels(indexselected);
    There is not much difference in the accuracy between updating and not updating the image set that is randomly chosen for performance evaluation but building vocabulary takes significantly longer computation time. This result therefore shows that when using this particular way of evaluation, the vocabulary does not necessarily need to be updated.
    Vocabulary built using non-updated images in every iteration Vocabulary built using updated images in every iteration
    Average accuracy= 51.4% ; Standard deviation = 0.0513

    Average accuracy = 52.4% ; Standard deviation = 0.0585

  7. Implement the GIST feature representation
  8. An additional MATLAB function is added, named get_gist.m to obtain an additional feature descriptor besides the SIFT vocabulary descriptor for each training and testing image. A GIST descriptor is a 512-entry row vector, extracted through LMgist.m which is readily available online and provided in the project website. The combination of SIFT and GIST features allows the accuracy to improve to 77.7%, which is my best result of the project (computation time = ~30 minutes). Using only SIFT feature vocabulary representation and only GIST feature representation lead to accuracy of 73.1% and 66.6%, respectively.

    Example of code for get_gist.m

    image_feats = [];
    for z = 1:size(image_paths, 1)
    clear param
    param.imageSize = [256 256]; % it works also with non-square images
    param.orientationsPerScale = [8 8 8 8];
    param.numberBlocks = 4;
    param.fc_prefilt = 4;
    % Computing gist requires 1) prefilter image, 2) filter image and collect
    % output energies
    image = single(imread(cell2mat(image_paths(z))));
    [gist1, param] = LMgist(image, '', param);
    gist1 = double(gist1);
    image_feats = vertcat(image_feats,gist1);