Project 4 / Scene Recognition with Bag of Words

In this project I have experimented with image recognition. I have written and anlyzed few different algorithms in this area that goes from simple methods like tiny images and nearest neighbor classification to more advanced methods like bags of quantized local features and linear classifiers learned by support vector machines. This report is divided in four parts:

  1. Tiny images representation and nearest neighbor classifier
  2. Bag of SIFT representation and nearest neighbor classifier
  3. Bag of SIFT representation and linear SVM classifier
  4. Extra Credits

Let's go through each of them:

1. Tiny images representation and nearest neighbor classifier

  1. Here I 'm first simply resizing each image to a small, fixed resolution 16 x 16. Then I transform this image such that it has zero mean and unit length. Because it discards all of the high frequency image content and is not especially invariant to spatial or brightness shifts, it is not really a good represntation.

  2. Then I 'm using K-nearest neighbour method for classification. It has many desirable features like it requires no training, it can learn arbitrarily complex decision boundaries, and it trivially supports multiclass problems etc.

  3. To tackle its disadvantage of being quite vulnerable to training noise, I have implemented voting based on the K nearest neighbors, a more general form of 1-NN.

  4. Here are the results I obtained in my experiment:

    KAccuracy(%)
    122.5
    321.5
    521.3
    721.1
    921.1
    1121.8
    1322.3
    1522.5

  5. As we can see from the results, with tiny images as the features, varying K didn't really had any significant effect on the accuracy with test data set.

2. Bag of SIFT representation and nearest neighbor classifier

  1. Here I 'm first building a vocabulary of visual word. I sample many local SIFT features from the training set (15000 - 1600000). I choose a specific number of clusters and then cluster them with K-means. The number of clusters is my vocab size

  2. . Now our feature representation of the images are histograms of these visual words. I obtain this histogram by simply counting number of SIFT descriptor falling into each cluster of my vocabulary. Also I 'm normalizing these histograms to remove the effect of feature magnitude.

  3. I run K-nearest neighbour now on these features for classification. I have played with 4 parameters from K-NN to SIFT to tune:

  4. ParameterDescription
    Step sizeHigher this is is more densely we are sampling SIFT descriptors
    Features per imageNumber of samples randomly picked up from SIFT descriptors
    Vocab sizeNumber of clusters formed in K-means
    KNumber of nearest neighbors in KNN

  5. Here are the results I obtained in my experiment:

    Vocab Size = 100, Features per image = 100, Step size = 4
    KAccuracy(%)
    150.4
    350.3
    551.9
    751.8
    951.6
    1150.6
    1550

    Vocab Size = 100, Features per image = 100, Step size = 8
    KAccuracy(%)
    148.1
    349.1
    551.1
    751.7
    951.9
    1151.2
    1550.9

    Vocab Size = 100, Features per image = 100, Step size = 16
    KAccuracy(%)
    143.7
    343.1
    544.3
    744.4
    943.2
    1143.8
    1543.7

  6. As we can see from the results, with bag of sifts as the features, varying K improved the accuracy on the test set, probably accounting to better denoising with higher K. The optimum K came to be around 7.

3. Bag of SIFT representation and linear SVM classifier

  1. Here I 'm first building a vocabulary of visual word. I sample many local SIFT features from the training set (15000 - 1600000). I choose a specific number of clusters and then cluster them with K-means. The number of clusters is my vocab size

  2. . Now our feature representation of the images are histograms of these visual words. I obtain this histogram by simply counting number of SIFT descriptor falling into each cluster of my vocabulary. Also I 'm normalizing these histograms to remove the effect of feature magnitude.

  3. As this is multi-class classification I have trained 1-vs-all linear SVMS. This method is much better than K-NN not just beacuse of being less expensivep, but because it can learn dimensions of the feature vector that are less relevant and then downweight them when making a decision. I have played with 4 parameters from K-NN to SIFT to tune:

  4. ParameterDescription
    Step sizeHigher this is is more densely we are sampling SIFT descriptors
    Features per imageNumber of samples randomly picked up from SIFT descriptors
    Vocab sizeNumber of clusters formed in K-means
    LAMBDAIt controls the regularization for the model parameters while learning

  5. Here are the results I obtained in my experiment:

    Vocab Size = 100, Features per image = 100, Step size = 4
    LAMBDAAccuracy(%)
    0.0000161.9
    0.000162.4
    0.00156.7
    0.0141.9
    0.143.9
    130.7
    1037.3

    Vocab Size = 100, Features per image = 100, Step size = 8
    LAMBDAAccuracy(%)
    0.0000158.3
    0.000160.3
    0.00154.9
    0.0146.7
    0.143.9
    138.2
    1039.1

    Vocab Size = 100, Features per image = 100, Step size = 16
    LAMBDAAccuracy(%)
    0.0000149.9
    0.000150.5
    0.00148.2
    0.0141.5
    0.136.7
    137.5
    1038.4

  6. As we can see from the results, with bag of sifts as the features, SVM performed really well comparatively to KNN being less expensive. This shows it less vulneraility to noise in the data. The optimum lambda came to be around 0.0001.







Extra Credits

Experimental design


1. Experiment with different vocabulary sizes
Look for vocab_{size}.mat files in the code directory

I experimented with Bag of SIFT features and SVM learners. From the table we can see that on increasing the vocab size, the performance improved till a certain vocab size after which it started degrading slightly in addition to increased runtime.

Step size = 8, LAMBDA = 0.0001, Features per image = 100
Vocab sizeAccuracy(%)
5055.3
10060.3
20061.4
30062.1
40061.7
50060.4
100059.5

2. Cross validation
Code in run_cross_validation.m

I ran this 10 times. Each time I took 100 samples per class randomly for both train and test sets. Then ran my baseline mathod of bag of sifts with svm. The result seems to be pretty stable on my final tuned parameters.

Mean accuracy58.39
Standard deviaton2.69

Step size = 8, Vocab size = 100, LAMBDA = 0.0001, Features per image = 100
Run numberAccuracy(%)
155.6
259.3
362.6
458.3
561.8
660.7
755.2
856.9
955.6
1057.9



3. Validation set to tune the model
Code in run_validation_based_tuning.m

I experimented with my simple SVM classifier to tune its regularization parameter based on a validation set. I pick my validation set from the training set itself. I 'm doing a 10-fold validation.

Final unused test set accuracy: 60.3
Step size = 8, Vocab size = 100, LAMBDA = 0.0001, Features per image = 100
LAMBDAMean accuracy(%)Standard deviation
0.00000152.94.25
0.0000155.34.65
0.000157.14.26
0.00155.74.05
0.0141.43.75
0.139.53.78
130.63.68
1035.62.64




Feature quantization and bag of words


1. Kernel codebook encoding
Code in get_kernel_codebook.m

I have used 'soft assignment' using kernel codebook encoding for bag of words. In this method, assuming a gaussian distribution at each cluster a feature will be assigned to serveral clusters but with a weight that will be proportional to the distance. It performed better than baseline SVM+BOS, but not with significant improved as baseline was 60%. I tried with various sigma and found this as the best value.

Sigma = 210
Vocab Size = 100, Features per image = 100, Step size = 8
LAMBDAAccuracy(%)
1037.3
130.7
0.143.9
0.000149.7
0.0000159.5
0.00000164.2
0.000000165.3


2. Fisher encoding
Code in get_fisher_vectors.m

In addition to keeping the count of descriptors per word, it also encodes additional information about the distribution of the descriptors. As we can see from the results it improved the accuracy significantly from our baseline SVM+BOS which was 60%.

Step size = 4, LAMBDA = 0.0001
Number of clustersAccuracy(%)
1072.0
1074.3
5076.1

Step size = 8, LAMBDA = 0.0001
Number of clustersAccuracy(%)
1070.3
3072.1
5075.3




Classifier


1. RBF and Chi-square
Code in get_chisqr_classify.m and get_rbf_classify.m

Here, I got good results with Chi-square of around 63.1% but with RBF, the results were not as good, came out to be around 53.2%



Spatial Pyramid representation and classifier


1. Spatial pyramid feature representation
Code in get_spatial_pyramid.m

Here I have tried to analyze effects of adding spatial information to the features by creating a grid with varying granularity over the image, therefor making coarser grids. At any fixed level, two points are said to match if they fall into the same cell of the grid. The results are better at higher level as we can see from the results.

LAMBDA = 0.0001
Step sizePyramid levelAccuracy(%)
16055.1
16158.8
16262.6
16363.9
8369.7
4370.9







Best Result

Although I achieved an accuracy of 76.1 with Step size 4, I would selected step size of 8 for displaying my results as it is easy to reproduce given the time it takes to run. Accuracy = 72.5%, Step size = 8, Fisher

Scene classification results visualization


Accuracy (mean of diagonal of confusion matrix) is 0.725

Category name Accuracy Sample training images Sample true positives False positives with true label False negatives with wrong predicted label
Kitchen 0.590
LivingRoom

Bedroom

Industrial

Bedroom
Store 0.720
LivingRoom

Industrial

Industrial

LivingRoom
Bedroom 0.470
Kitchen

LivingRoom

Coast

Store
LivingRoom 0.490
Bedroom

Bedroom

Bedroom

Store
Office 0.930
LivingRoom

Kitchen

Bedroom

LivingRoom
Industrial 0.710
Store

Store

Store

Kitchen
Suburb 0.990
OpenCountry

InsideCity

InsideCity
InsideCity 0.710
Street

Highway

TallBuilding

Kitchen
TallBuilding 0.790
Industrial

Street

Street

Office
Street 0.670
InsideCity

TallBuilding

Industrial

InsideCity
Highway 0.820
OpenCountry

InsideCity

InsideCity

Coast
OpenCountry 0.530
Mountain

Coast

Coast

Coast
Coast 0.780
OpenCountry

OpenCountry

Mountain

Mountain
Mountain 0.810
Coast

Industrial

Highway

Coast
Forest 0.860
TallBuilding

OpenCountry

Street

OpenCountry
Category name Accuracy Sample training images Sample true positives False positives with true label False negatives with wrong predicted label