The goal of this project is to introduce you to image recognition. Specifically, we will examine the task of scene recognition starting with very simple methods -- tiny images and nearest neighbor classification -- and then move on to more advanced methods -- bags of quantized local features and linear classifiers learned by support vector machines.

The tiny image features are very simple: just use imresize() to shrink each image to 16x16 size, and then normalize the resulting features.

This is slightly more complicated: first we need to build vocabulary. By using vl_dsift(), we get a bunch of sift features out of each image. After that, we use k-means
to get the actual "vocabulary".

This vl_dsift() function has several parameters we can change. I choose bin size to be 8 and step size to be 4 or 8 when building the vocabulary, because a small step size
will make the sift features more accurate. When calculating bags of sift features for all images, I change step size to 8 because it will make the calculation a little
faster.

In the k-means step, k is equal to pre-defined vocabulary size. The choices of vocabulary size will be discussed later in extra credit part.

Because I used 1-NN in nearest neighbor, there is not much to say about that. Here I will discuss my linear SVM implementation.

This implementation uses 1-vs-all SVMs to classify each sample, and as there are 15 classes, we need to compute 15 SVMs. Each SVM will decide whether a sample belongs to
a certain class and give out a number. The sample will finally belongs to the class which its corresponding SVM gives out the largest number.

This linear SVM uses the function vl_svmtrain(). There is something interesting: if we set labels to -1s and 1s, the SVM will actually performs a little better than
if we set labels to 0s and 1s. I suppose this is because of the implementation of this specific SVM.

There is another parameter, lambda, which we can play with in this function. I do not tune it too precisely, but rather roughly tested several values by intuition.
Finally I go with lambda = 0.00005, which gives me a quite good result. The actual accuracy is discussed in later sections.

The first table shows accuracies based on the whole dataset(i.e. combination of all images in training and test folder). All samples are randomized.

The last result(Bag of SIFT + SVM w/ Gaussian/RBF kernel, vocab_size = 5000) is trained and tuned using k-fold cross-validation and the final accuracy is calculated on
a separate validation set.

Method | Accuracy |

Tiny Image + Nearest Neighbor Classifier | 23.9% |

Bag of SIFT + Nearest Neighbor Classifier | 55.7% |

Bag of SIFT + Linear SVM Classifier | 71.9% |

Bag of SIFT + SVM w/ Gaussian/RBF kernel, vocab_size = 200 | 73.7% |

Bag of SIFT + SVM w/ Gaussian/RBF kernel, vocab_size = 5000 | 79.6% |

The second table shows the performance and setup of the first three conditions runs on the original 100 sample per class data set, which required by grading rubric:

Method | Confusion Matrix | Accuracy | Execution Time | Parameters |

Tiny Image + Nearest Neighbor Classifier | 21.7% | 19.6s | vocab_size=100; 16x16; Normalized; 1-NN | |

Bag of SIFT + Nearest Neighbor Classifier | 50.3% | 36.6s | vocab_size=100; SIFT: size=8, step=8; 1-NN | |

Bag of SIFT + Linear SVM Classifier | 65.7% | 59.9s | vocab_size=100; SIFT: size=8, step=8; Linear SVM: Lambda = 0.00005 |

And here is the auto-generated confusion matrix and table for the best result(Bag of SIFT + SVM w/ Gaussian/RBF kernel, vocab_size = 5000):

Accuracy (mean of diagonal of confusion matrix) is 0.796

Category name | Accuracy | Sample training images | Sample true positives | False positives with true label | False negatives with wrong predicted label | ||||
---|---|---|---|---|---|---|---|---|---|

Kitchen | 0.345 | Industrial |
Bedroom |
Bedroom |
Store |
||||

Store | 0.892 | Bedroom |
LivingRoom |
LivingRoom |
TallBuilding |
||||

Bedroom | 0.345 | Office |
Kitchen |
LivingRoom |
LivingRoom |
||||

LivingRoom | 0.579 | Kitchen |
Kitchen |
Mountain |
Bedroom |
||||

Office | 0.446 | LivingRoom |
LivingRoom |
Kitchen |
LivingRoom |
||||

Industrial | 0.579 | LivingRoom |
Store |
Store |
Coast |
||||

Suburb | 0.791 | Bedroom |
Industrial |
Industrial |
OpenCountry |
||||

InsideCity | 0.780 | TallBuilding |
Store |
Bedroom |
LivingRoom |
||||

TallBuilding | 1.114 | Street |
Highway |
LivingRoom |
Industrial |
||||

Street | 0.880 | Highway |
InsideCity |
TallBuilding |
InsideCity |
||||

Highway | 0.758 | Coast |
Street |
OpenCountry |
Coast |
||||

OpenCountry | 1.070 | Coast |
Highway |
Forest |
Coast |
||||

Coast | 1.126 | InsideCity |
Highway |
OpenCountry |
OpenCountry |
||||

Mountain | 1.148 | Coast |
Industrial |
Bedroom |
OpenCountry |
||||

Forest | 1.092 | OpenCountry |
OpenCountry |
OpenCountry |
OpenCountry |
||||

Category name | Accuracy | Sample training images | Sample true positives | False positives with true label | False negatives with wrong predicted label |

Besides the linear SVM, I also used an SVM classifier with Gaussian/RBF kernel. I did this with the widely used
LIBSVM. I choose to use C-SVM which makes classifications easier, and also use Gaussian/RBF kernel.

There are two parameters, gamma and penalty parameter C, which we can tune. In the cross-validation section I will discuss how I choose them.

As shown before, using a Gaussian/RBF kernel can improve the accuracy from about 70% to 79%, which is quite good.

I used k-fold cross-validation to tune my parameter sigma(which is related to gamma: gamma = 1/(2*sigma^2)) and penalty parameter C in my RBF SVM.
Instead of the proposed way on assignment webpage, I used the "traditional" way, in which I just divided the whole training set randomly into k parts with roughly
equal sizes. This makes it much longer to compute, though.

The values of sigma and C I tried are the following: [0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, 3, 10, 30]. As you may realize, this is roughly a geometric sequence with
a factor of 3, and practice shows this is quite effective. Because there are two variables, altogether I will try 10^2 = 100 combinations, and because I choose k=12 in
k-fold cross-validation, I will compute 100*12 = 1200 models every time. This is definitely a lot to compute, and because I can only run this on CPU, it takes me some
time even if I have parallelized it partially.

Not too surprising, the result is quite uniform when I tried to get the best parameter under different vocabulary sizes: almost every time the best sigma is 1, and
the best C is often 3, 10 or 30.

The following table shows a visualization of average accuracies by cross-validation when tuning parameters. I forgot to save the accuracies, so I cannot show the exact
accuracies at the time these graphs are calculated, but you can definitely get some ideas from the graph:

vocab_size=10. Best at sigma=0.3, C=1, accuracy is roughly 54% |

vocab_size=20. Best at sigma=1, C=30, accuracy is roughly 62% |

vocab_size=50. Best at sigma=1, C=10, accuracy is roughly 67% |

vocab_size=100. Best at sigma=1, C=3, accuracy is roughly 71% |

vocab_size=200. Best at sigma=1, C=3, accuracy is roughly 73% |

vocab_size=400. Best at sigma=1, C=10, accuracy is roughly 77% |

vocab_size=1000. Best at sigma=1, C=3, accuracy is roughly 77% |

vocab_size=5000. Best at sigma=1, C=10, accuracy is roughly 79% |

I cut out 30% of the whole dataset as a validation set to evaluate the performance of the chosen vocabulary size. (Actually this method is also used when tuning sigma and C, but in that process I only use the validation set to see the final performance rather than actually "tuning" sigma and C) Nonetheless, this validation set is just a randomly chosen 30 percent samples of the whole set, and it is used to make sure the model is not overfitting. By using the validation set, I still get a high accuracy with my models, which shows that they are well generalized. The actual accuracy I got on validation set with different vocabulary sizes will be shown in the next section.

I experimented on the following vocabulary sizes: 10, 20, 50, 100, 200, 400, 1000 and 5000. 10000 is too big for my machine to calculate.

The result is as I expected: the larger the vocabulary size is, the higher the accuracy will be.

In the following table is the confusion matrices with different values of vocabulary sizes, and also its accuracy on a separate validation set:

Vocabulary Size | Confusion Matrix | Accuracy |

10 | 52.3% | |

20 | 63.3% | |

50 | 70.0% | |

100 | 72.6% | |

200 | 73.7% | |

400 | 73.8% | |

1000 | 77.4% | |

5000 | 79.6% |

And here is a visualization of the relationship between vocabulary size and accuracy. For clarity, vocabulary sizes are plotted on logarithmic scale: