Project 4 / Scene Recognition with Bag of Words

scene recognition

Fig.1. Example of scene recognition with bag of words.

For this project, the goal is the introduction of image recognition which means scenes will be classified into one of 15 categories by training and testing on the 15 scene database. In the first step, scene recognition with relatively simple methods will be performed, that is, using tiny images and nearest neighbor classification. Then, more advanced methods will be performed - bag of quantized local features and linear classifiers learned by support vector machines - during which the combination of tiny images and linear SVM classifier will be examined first and consequently the combination of bag of words and linear SVM classifier. Besides, before the training and testing images are represented as bag of feature histograms, a vocabulary of visual words is supposed to be established first.

This report will discuss the project in the following aspects:

  1. Tiny Images Representation and Nearest Neighbor Classifier
  2. Bag of SIFT Representation and Nearest Neighbor Classifier
  3. Bag of SIFT Representation and Linear SVM Classifier
  4. Extra Credit - Different Vocabulary Sizes
  5. Extra Credit - Tuning Parameters in build_vocabulary.m and get_bags_of_sifts.m
  6. Extra Credit - Tuning Parameter in svm_classify.m
  7. Extra Credit - Fisher Encoding
  8. Extra Credit - Naive Bayes Nearest Neighbor Classifier
  9. Best Performing Case

1.Tiny Images Representation and Nearest Neighbor Classifier

1.1. Algorithm

In this part, proj4.m, get_tiny_images.m and nearest_neighbor_classify.m will be used. For the representation portion, tiny images is performed which is one of the simplest possible image representation. We just need to resize each image to a small and fixed resolution which is 16*16 in this project and in order to make this work more perfectly, the resized image is made to have zero mean and unit length. The relative code is shown as follows:


    width=16;
    for i=1:length(image_paths)
        % Resize the image
        image=im2double(imread(image_paths{i}));
        Dist=imresize(image,[width,width]);
        % Normalization
        meanA=mean2(Dist);
        for j=1:size(Dist,1)
            for k=1:size(Dist,2)
            Dist(j,k)=Dist(j,k)-meanA;
            end
        end
        Dist=Dist./std2(Dist);
        image_feats(i,:)=reshape(Dist,[1,width*width]);
    end
    

Then, the classifier part will be implemented where the nearest neighbor classifier is used here. This classifier is quite simple since one just needs to find the nearest training example and assign the test case the label of that nearest training example if desire to deal with the case of classifying the test feature into a particular category. The code is displayed below.


    D=vl_alldist2(train_image_feats',test_image_feats');
    [Y,I]=min(D);
    num=size(test_image_feats,1);
    predicted_categories=cell(num,1);
    for i=1:num
        predicted_categories{i}=train_labels{I(i)};
    end

1.2. Results

Fig.2. The results after utilizing tiny images representation and nearest neighbor classifier.

The results show that the accuracy is about 22.5 % which accords with the given 18-25 %. The accuracy is relatively low possibly due to the fact that tiny image representation discards all of the high frequency image content and nearest neighbor classifier is vulnerable to training noise.

2.Bag of SIFT Representation and Nearest Neighbor Classifier

2.1. Algorithm

In this portion, build_vocabulary.m, get_bags_of_sifts.m, nearest_neighbor_classify.m will be used. Establishing a vocabulary of visual words is in the first place which samples the many local features from the train set and establish the basis for histogram. Then, many SIFT descriptor will be sampled for each image and the number of SIFT descriptors falling into each cluster in the visual word vocabulary will be counted. Afterwards, the histogram is supposed to be normalized. Part of the code is exhibited below. Finally bag of SIFT representation is tested when paired with the nearest neighbor classifier written in the first part.


    d=1;
    Col=floor((size(image_paths,1)-1)/d+1);
    FeaturesN=cell(1,500);
    StepNum=4;
    for i = 1:d:size(image_paths,1)
        [Locations, Features] = vl_dsift(single(imread(image_paths{i})),'norm','step',StepNum);
        n=floor((i-1)/d+1);
        FeaturesN{1,n}=Features;
        display(i);
    end
    FN=cell2mat(FeaturesN);
    Sing_FN=single(FN);
    vocab=vl_kmeans(Sing_FN,vocab_size);
    vocab=vocab';
    end
    

    for i = 1:1:size(image_paths,1)
        [Locations,Features]=vl_dsift(single(imread(image_paths{i})),'norm','step',StepNum);
        Dist=vl_alldist2(single(Features),vocab');
    
        row_vocab=size(vocab,1);
        Fig_Feats=zeros(1,row_vocab);
        for j = 1:1:size(Features,2)
            [~,Index]=min(Dist(j,:));
            Fig_Feats(Index)=Fig_Feats(Index)+1;
        end
        image_feats(i,:)=Fig_Feats/sum(Fig_Feats);
        display(i)
    end
    

2.2. Results

2.2.1. Ordinary and Best Results

Fig.3. The ordinary results after utilizing bag of SIFT representation and nearest neighbor classifier.

From the results, the accuracy is approximately 50.9 % which increases greatly compared to the first part which is 22.5 % and accords with the given 50-60 %. When experimenting with this part, it can be found that the vocabulary size and the StepNum in build_vocabulary.m and get_bags_of_sifts.m are the free parameters that influence the accuracy and thus are able to be tuned with which will be covered in part 4 and 5. The vocabulary size and StepNum used for the above results is 200 and 4 respectively.

The best results for this part is shown as follows but its computing time is relatively longer. The parameters for best results - StepNum in build_vocabulary.m and get_bags_of_sifts.m are both 4, the vocabulary size is 400. The results showing that the accuracy is about 53.1 %.

Fig.4. The best results after utilizing bag of SIFT representation and nearest neighbor classifier.

2.2.2. Fast Results

The parameters for short computing time are as follows: StepNum in build_vocabulary.m and get_bags_of_sifts.m is 4 and 10 respectively, the vocabulary size is 200. The results showing that the accuracy is about 47.1 % above required 45 % and the total computing time without running building vocabulary is approximately 8 minutes and 15 seconds.

Fig.5. The fast results after utilizing bag of SIFT representation and nearest neighbor classifier.

3.Bag of SIFT Representation and Linear SVM Classifier

3.1. Algorithm

In this section, build_vocabulary.m, get_bags_of_sifts.m, svm_classify.m will be utilized. The focus of this part is training 1-vs-all linear SVMS to operate in the bag of SIFT feature space where linear classifiers are one of the simplest possible learning models. A learned hyperplane partitions the feature space and test cases are categorized on the basis of which side of the hyperplane they fall on. Deciding which of 15 categories a test case belongs to is the central issue of this project and the most confidently positive wins. The corresponding code is displayed as follows:


    for i=1:length(categories)
        match_indices=strcmp(categories{i},train_labels);
        Labels(find(~match_indices))=-1;
        Labels(find(match_indices))=1;
        Labels=double(Labels);
        [W(:,i), B(1,i)]=vl_svmtrain(train_image_feats',Labels,Lambda);
        display(i);
    end
    
    Dist=zeros(1,length(categories));
    for j=1:size(test_image_feats,1)
        for k=1:length(categories)
            Dist(1,k)=dot(test_image_feats(j,:),W(:,k))+B(1,k);
        end
        [~,Ind]=max(Dist);
        predicted_categories{j}=categories{Ind};
    end
    

3.2. Results

3.2.1. Ordinary and Best Results

Fig.6. The ordinary results after utilizing bag of SIFT representation and linear SVM classifier.

The accuracy with this combination is 68.8 % conforming to the given range from 60 % to 70 %. There are several free parameters that can be tuned with to achieve different accuracies, such as, StepNum in both build_vocabulary.m and get_bags_of_sifts.m and Lambda in svm_classify.m which will be tuned in part 5 and part 6. The parameters set for the above results is as follows - vocabulary size: 400, StepNum: 4 and Lambda: 0.000001.

The best results for this part is displayed below but the time is extremely long. The parameters for best results - StepNum in build_vocabulary.m and get_bags_of_sifts.m are both 4, the vocabulary size is 1000. The results showing that the accuracy is about 71.5 %.

Fig.7. The best results after utilizing bag of SIFT representation and linear SVM classifier.

3.2.2. Fast Results

The parameters for short computing time are as follows: StepNum in build_vocabulary.m and get_bags_of_sifts.m is 4 and 10 respectively, the vocabulary size is 200. The results showing that the accuracy is about 60.5 % which is higher than the required 55 % and the total computing time without running building vocabulary is approximately 8 minutes and 32 seconds.

Fig.8. The fast results after utilizing bag of SIFT representation and linear SVM classifier.

4.Extra Credit - Different Vocabulary Sizes

It is known that the size of vocabulary can affect the results of accuracy so in this part, I will use the combination of bag of SIFT representation and linear SVM classifier to test the influence of it. The following table.1 shows the results with both StepNum of 4 and Lambda of 0.000001.

Table.1 The relationship between vocabulary size and accuracy.

Vocabulary Size Accuracy Confusion Matrix
10 39.2 %
20 52.1 %
50 61.9 %
100 66.8 %
200 67.4 %
400 68.8 %
1000 71.5 %

From the results above, it can be concluded that with the increase of vocabulary size, the accuracy also increases accordingly. Enhancing the size from 10 to 50, the increment of the accuracy is more than 20 % while when the vocabulary size is larger than 100, the increment seems not that distinct but the computing time enhances greatly. Thus, it seems that the vocabulary size being set as 100 or 200 will be reasonable to balance both the accuracy as well as computing time.

5.Extra Credit - Tuning Parameters in build_vocabulary.m and get_bags_of_sifts.m

5.1. StepNum in build_vocabulary.m

Table.2 The relationship between the StepNum in build_vocabulary.m and accuracy.

StepNum Accuracy Confusion Matrix
4 66.8 %
6 65.7 %
16 65.4 %

Table.2 indicates that the accuracy does not change much with the variation of StepNum which means the influence of the StepNum here is not that much.

5.2. StepNum in get_bags_of_sifts.m

Table.3 The relationship between the StepNum in get_bags_of_sifts.m and accuracy.

StepNum Accuracy Confusion Matrix
4 66.8 %
6 62.2 %
8 60.6 %
10 60.3 %
16 54.7 %
24 46.7 %

It is able to see that the accuracy shows an increment when the StepNum which is in get_bags_of_sifts.m reduces. And compared with the other StepNum discussed above, we can see that the change is more obvious, in other words, this StepNum has a bigger effect on the results.

6.Extra Credit - Tuning Parameter in svm_classify.m

6.1. Lambda

Table.4 The relationship between the Lambda in svm_classify.m and accuracy.

Lambda Accuracy Confusion Matrix
0.000001 66.8 %
0.00001 62.9 %
0.0001 58.6 %
0.001 42.7 %

From the results above, it indicates that the value of Lambda is an important parameter that dominate the results of accuracy. With the decrease of Lambda value, it is capable of obtaining better accuracy.

7.Extra Credit - Fisher Encoding

In this portion, a more sophisticated feature encoding schemes - Fisher encoding has been implemented which is a generic framework combining the advantages of generative as well as discriminative approaches. The relative code is displayed as follows:


    for i = 1:1:size(train_image_path,1)
        [~,Features] = vl_dsift(single(imread(train_image_path{i})),'norm','step',StepNum);
        DataX{i}=Features;
        display(i);
    end
    
    DataX=cell2mat(DataX);
    DataX=single(DataX);
    [means,covariances,priors]=vl_gmm(DataX,NumClusters);
    

    [means,covariances,priors]=fisher_vocab(train_image_path,NumClusters);
    for i=1:1:size(train_image_path,1)
        [~,train_SIFT]=vl_dsift(single(imread(train_image_path{i})),'norm','step',StepNum);
        [~,test_SIFT]=vl_dsift(single(imread(test_image_path{i})),'norm','step',StepNum);
        train_encoding=vl_fisher(single(train_SIFT),means,covariances,priors,'normalized');
        test_encoding=vl_fisher(single(test_SIFT),means,covariances,priors,'normalized');
        train_image_feats{i,1}=train_encoding';
        test_image_feats{i,1}=test_encoding';
        display(i);
    end
    

Table.5 The relationship between the NumClusters and accuracy.

NumClusters Accuracy Confusion Matrix
100 77.1 %
50 74.2 %
20 72.2 %
10 69.1 %

It is able to be concluded that the increase of the NumClusters from 10 to 100 leads to the increment of accuracy approaching approximately 8 %. Besides, compared with the results using given method, the accuracy of Fisher encoding enhances since the best one of Fisher encoding is around 77.1 % while that of original is about 71.5 % which requires much more computing time than Fisher encoding.

8.Extra Credit - Naive Bayes Nearest Neighbor (NNBN)

Naive Bayes Nearest Neighbor employs NN-distances in the space of the local image descriptors rather than that of images. Additionally, it directly computes the Image-to-Class distances without descriptor quantization. The NBNN is able to be summarized as follows. The corresponding code is also shown below.

Fig.9. The algorithm of Naive Bayes Nearest Neighbor.


    [train_image_feats,train_image_label]=sift_2(train_image_paths,train_labels);
    [test_image_feats,test_image_label]=sift_1(test_image_paths);
    function predicted_categories = NBNN(test_image_paths,train_labels,train_image_feats,train_image_label,test_image_feats,test_image_label)
    
    ...
    for l=1:1:size(test_image_paths,1)
    ...
    for k=1:1:length(categories)
    ...
    Dist=vl_alldist2((train_image_feats(index_1,:))',(test_image_feats(index_2,:))');
    [NNC,~]=min(Dist);
    NNCC(k,:)=NNC;
    end
    SNNCC_Col=sum(NNCC,2);
    [~,Index]=sort(SNNCC_Col);
    predicted_categories{l,1}=categories{Index(1)};
    display(l);
    end
    end
    

The resulting confusion matrices are shown below. It is undoubted that the computing time for this algorithm is definitely larger compared with the former algorithms since every feature from every image is needed to be compared. In order to reduce the computing time to a reasonable one, the StepNum in sift.m is set to be 40. In this situation, the accuracy is around 51.4 % which is not that higher due to the fact that the number of features are too small. But with the same parameter setting - both StepNum equal to 40, the results for the combination of bag of SIFT and linear SVM classifier also indicated below and the accuracy is around 36.7 % which means that the accuracy utilizing NBNN is higher than that using linear SVM.

Fig.10. The results for using NBNN (left) and using the combination of bag of SIFT and linear SVM (right).

9.Best Performing Case

The best performing case I have achieved for this project is using the combination of fisher encoding and linear SVM and its accuracy is around 77.1 %. The corresponding outcomes are shown below:

Scene classification results visualization


Accuracy (mean of diagonal of confusion matrix) is 0.771

Category name Accuracy Sample training images Sample true positives False positives with true label False negatives with wrong predicted label
Kitchen 0.610
Bedroom

Store

Office

Bedroom
Store 0.770
Kitchen

LivingRoom

Office

Industrial
Bedroom 0.590
LivingRoom

OpenCountry

LivingRoom

LivingRoom
LivingRoom 0.570
Bedroom

Industrial

Office

Bedroom
Office 0.940
Store

Store

Bedroom

Kitchen
Industrial 0.750
Store

Mountain

Highway

Suburb
Suburb 0.990
Street

Street

InsideCity
InsideCity 0.740
Street

Street

Industrial

TallBuilding
TallBuilding 0.840
Bedroom

InsideCity

Kitchen

Street
Street 0.780
OpenCountry

Kitchen

Suburb

InsideCity
Highway 0.870
OpenCountry

InsideCity

Coast

Forest
OpenCountry 0.620
Coast

Mountain

Coast

Highway
Coast 0.770
Highway

Industrial

Highway

OpenCountry
Mountain 0.800
Coast

Highway

TallBuilding

Coast
Forest 0.930
OpenCountry

Mountain

Mountain

Street
Category name Accuracy Sample training images Sample true positives False positives with true label False negatives with wrong predicted label