Computer Vision Project

The goal of this project is to introduce you to image recognition. Specifically, we will examine the task of scene recognition starting with very simple methods -- tiny images and nearest neighbor classification -- and then move on to more advanced methods -- bags of quantized local features and linear classifiers learned by support vector machines.

Basic Implementation

Tiny Images

The tiny image features are very simple: just use imresize() to shrink each image to 16x16 size, and then normalize the resulting features.

Bag of SIFT

This is slightly more complicated: first we need to build vocabulary. By using vl_dsift(), we get a bunch of sift features out of each image. After that, we use k-means to get the actual "vocabulary".
This vl_dsift() function has several parameters we can change. I choose bin size to be 8 and step size to be 4 or 8 when building the vocabulary, because a small step size will make the sift features more accurate. When calculating bags of sift features for all images, I change step size to 8 because it will make the calculation a little faster.
In the k-means step, k is equal to pre-defined vocabulary size. The choices of vocabulary size will be discussed later in extra credit part.

Linear SVM

Because I used 1-NN in nearest neighbor, there is not much to say about that. Here I will discuss my linear SVM implementation.
This implementation uses 1-vs-all SVMs to classify each sample, and as there are 15 classes, we need to compute 15 SVMs. Each SVM will decide whether a sample belongs to a certain class and give out a number. The sample will finally belongs to the class which its corresponding SVM gives out the largest number.

This linear SVM uses the function vl_svmtrain(). There is something interesting: if we set labels to -1s and 1s, the SVM will actually performs a little better than if we set labels to 0s and 1s. I suppose this is because of the implementation of this specific SVM.
There is another parameter, lambda, which we can play with in this function. I do not tune it too precisely, but rather roughly tested several values by intuition. Finally I go with lambda = 0.00005, which gives me a quite good result. The actual accuracy is discussed in later sections.

Accuracy/Performance Report

The first table shows accuracies based on the whole dataset(i.e. combination of all images in training and test folder). All samples are randomized.
The last result(Bag of SIFT + SVM w/ Gaussian/RBF kernel, vocab_size = 5000) is trained and tuned using k-fold cross-validation and the final accuracy is calculated on a separate validation set.

Method	Accuracy
Tiny Image + Nearest Neighbor Classifier	23.9%
Bag of SIFT + Nearest Neighbor Classifier	55.7%
Bag of SIFT + Linear SVM Classifier	71.9%
Bag of SIFT + SVM w/ Gaussian/RBF kernel, vocab_size = 200	73.7%
Bag of SIFT + SVM w/ Gaussian/RBF kernel, vocab_size = 5000	79.6%

The second table shows the performance and setup of the first three conditions runs on the original 100 sample per class data set, which required by grading rubric:

Method	Confusion Matrix	Accuracy	Execution Time	Parameters
Tiny Image + Nearest Neighbor Classifier		21.7%	19.6s	vocab_size=100; 16x16; Normalized; 1-NN
Bag of SIFT + Nearest Neighbor Classifier		50.3%	36.6s	vocab_size=100; SIFT: size=8, step=8; 1-NN
Bag of SIFT + Linear SVM Classifier		65.7%	59.9s	vocab_size=100; SIFT: size=8, step=8; Linear SVM: Lambda = 0.00005

And here is the auto-generated confusion matrix and table for the best result(Bag of SIFT + SVM w/ Gaussian/RBF kernel, vocab_size = 5000):

Accuracy (mean of diagonal of confusion matrix) is 0.796

Category name	Accuracy	Sample training images	Sample true positives	False positives with true label		False negatives with wrong predicted label
Kitchen	0.345			Industrial	Bedroom	Bedroom	Store
Store	0.892			Bedroom	LivingRoom	LivingRoom	TallBuilding
Bedroom	0.345			Office	Kitchen	LivingRoom	LivingRoom
LivingRoom	0.579			Kitchen	Kitchen	Mountain	Bedroom
Office	0.446			LivingRoom	LivingRoom	Kitchen	LivingRoom
Industrial	0.579			LivingRoom	Store	Store	Coast
Suburb	0.791			Bedroom	Industrial	Industrial	OpenCountry
InsideCity	0.780			TallBuilding	Store	Bedroom	LivingRoom
TallBuilding	1.114			Street	Highway	LivingRoom	Industrial
Street	0.880			Highway	InsideCity	TallBuilding	InsideCity
Highway	0.758			Coast	Street	OpenCountry	Coast
OpenCountry	1.070			Coast	Highway	Forest	Coast
Coast	1.126			InsideCity	Highway	OpenCountry	OpenCountry
Mountain	1.148			Coast	Industrial	Bedroom	OpenCountry
Forest	1.092			OpenCountry	OpenCountry	OpenCountry	OpenCountry
Category name	Accuracy	Sample training images	Sample true positives	False positives with true label		False negatives with wrong predicted label

Extra Credits

Train the SVM with more sophisticated kernels

Besides the linear SVM, I also used an SVM classifier with Gaussian/RBF kernel. I did this with the widely used LIBSVM. I choose to use C-SVM which makes classifications easier, and also use Gaussian/RBF kernel.
There are two parameters, gamma and penalty parameter C, which we can tune. In the cross-validation section I will discuss how I choose them.
As shown before, using a Gaussian/RBF kernel can improve the accuracy from about 70% to 79%, which is quite good.

Use cross-validation to measure performance

I used k-fold cross-validation to tune my parameter sigma(which is related to gamma: gamma = 1/(2*sigma^2)) and penalty parameter C in my RBF SVM. Instead of the proposed way on assignment webpage, I used the "traditional" way, in which I just divided the whole training set randomly into k parts with roughly equal sizes. This makes it much longer to compute, though.
The values of sigma and C I tried are the following: [0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, 3, 10, 30]. As you may realize, this is roughly a geometric sequence with a factor of 3, and practice shows this is quite effective. Because there are two variables, altogether I will try 10^2 = 100 combinations, and because I choose k=12 in k-fold cross-validation, I will compute 100*12 = 1200 models every time. This is definitely a lot to compute, and because I can only run this on CPU, it takes me some time even if I have parallelized it partially.

Not too surprising, the result is quite uniform when I tried to get the best parameter under different vocabulary sizes: almost every time the best sigma is 1, and the best C is often 3, 10 or 30.
The following table shows a visualization of average accuracies by cross-validation when tuning parameters. I forgot to save the accuracies, so I cannot show the exact accuracies at the time these graphs are calculated, but you can definitely get some ideas from the graph:

vocab_size=10. Best at sigma=0.3, C=1, accuracy is roughly 54%

vocab_size=20. Best at sigma=1, C=30, accuracy is roughly 62%

vocab_size=50. Best at sigma=1, C=10, accuracy is roughly 67%

vocab_size=100. Best at sigma=1, C=3, accuracy is roughly 71%

vocab_size=200. Best at sigma=1, C=3, accuracy is roughly 73%

vocab_size=400. Best at sigma=1, C=10, accuracy is roughly 77%

vocab_size=1000. Best at sigma=1, C=3, accuracy is roughly 77%

vocab_size=5000. Best at sigma=1, C=10, accuracy is roughly 79%

Add a validation set to your training process to tune learning parameters

I cut out 30% of the whole dataset as a validation set to evaluate the performance of the chosen vocabulary size. (Actually this method is also used when tuning sigma and C, but in that process I only use the validation set to see the final performance rather than actually "tuning" sigma and C) Nonetheless, this validation set is just a randomly chosen 30 percent samples of the whole set, and it is used to make sure the model is not overfitting. By using the validation set, I still get a high accuracy with my models, which shows that they are well generalized. The actual accuracy I got on validation set with different vocabulary sizes will be shown in the next section.

Experiment with many different vocabulary sizes and report performance

I experimented on the following vocabulary sizes: 10, 20, 50, 100, 200, 400, 1000 and 5000. 10000 is too big for my machine to calculate.
The result is as I expected: the larger the vocabulary size is, the higher the accuracy will be.

In the following table is the confusion matrices with different values of vocabulary sizes, and also its accuracy on a separate validation set: