Project 4 / Scene Recognition with Bag of Words

The coding for the main tasks was very straightforward this time. Therefore I will only report the accuracy of the three pipelines. They are close to what we were told to expect and maybe a little low because I would rather get to work on the extra credit than tune parameters for a suboptimal pipeline.

Accuracy of the three recognition pipelines

Pipeline Accuracy Number vocabulary size
tiny images + nearest neighbor 19.4
bag of SIFT + nearest neighbor 46.7 200
bag of SIFT + 1 vs all linear SVM 59.1 256

These were the best results that I was able to achieve. Please note that vl_svmtrain uses a stochastic optimization method and thus the accuracy may be lower than 55% in some cases, though it is in the 57-59% range most of the time.

Extra Credit

bag of SIFT + 1 vs all linear SVM at multiple scales

For the bag of SIFT setup I have sampled features from multiple levels of a Gaussian pyramid. Due to runtime concers regarding the k-means clustering I have only run this with SIFT features of cell size 8 and at a step length of 16. Considering features at multiple scales improved the performance mildly, however, using deep Gaussian pyramids (large L) actually decreased the performance. I attribute this to a lot of the images looking too similar at very small scales.

Pyramid levels L Accuracy
1 57.1
2 57.3
3 58.5
4 57.9
5 54.1

Non-linear SVM

Furthermore I have also tried non-linear SVMs with various kernels. The SVM code differs from the linear SVM code in that I have used matlabs SVM implementation and trained 105 1vs1 SVMs instead of 15 1vsAll SVMs because that was faster. For the data in the following table I have applied the non-linear SVM classifier to the best setup from the previous section, i.e. the bag-of-SIFT model with L = 3.

The first case (linear) was actually only meant to check, if the implementation works but as it turns out 1vs1-SVMs do not only train faster but also show superior classfication performance (the linear SVM achieved 58.5% vs. 64.3% by the non-linear SVM with linear kernel). Next, I tried the polynomial kernel which is apparently not a good fit for comparing histograms. After that came the Gaussian kernel, which performed really badly (7%), so not better than chance. This was very surprising and the confusion matrix showed that it classified basically every image as class 15. I do not think, it is a bug in my code because I have only changed the kernel function and the other kernel functions work well. Lastly I have implemented the chi-squared kernel as a custom kernel with vlfeat's vl_alldist2 function. The accuracy was an astonishing 69.5% which is 5% better than with the linear kernel and an 11% improvement over the linear SVM.

Kernel Accuracy (%) Training Time (s) Query Time (s)
Linear 64.3 2.8 0.3
Polynomial of degree 4 41.7 228 2.2
Gaussian 6.9 3.1 3.5
Chi-Squared 69.5 4.8 23.0

Fisher vector encoding

I have also implemented fisher vectors for feature quantization. My implementation follows this paper. That means that learning the "vocabulary", i.e. the GMM, consists of, first, collecting SIFT features with a cell size of 8 pixels and a step size of 16 pixels at 5 levels of a Gaussian pyramid. Then I use PCA to reduce the 128 dimensional features to their 64 dimensions. Finally I fit the GMM with vl_gmm, so only isotropic variances.

Classifier Accuracy (%) Training Time (s) Query Time (s)
Non-linear SVM with chi-squared kernel 7.7 1604 13712
1vs1 linear SVM 73.5 113 14.8
1vsAll linear SVM 78.1 8.8 2.1

First I tested the non-linear SVM classifier with the chi-squared kernel on this, just because it worked so well for the bag of SIFT words. Here, however, it did not work at all and in the confusion matrix you could see that the images were just categorized randomly. A possible explanation is that using Fisher vectors is already equivalent to working with a non-linear kernel and that somehow makes a second kernel redundant or incompatible. So I went back to linear SVMs, which is what they used in the paper as well, and actually got the best results with the original 1vsAll classifier.

Full confusion matrix and classifier results for best setup

I achieved my best results with fisher vectors and the normal 1vsAll linear SVM.

Accuracy of 78.1%

Category name Accuracy Sample training images Sample true positives False positives with true label False negatives with wrong predicted label
Kitchen 0.620
Store

Bedroom

LivingRoom

Bedroom
Store 0.810
LivingRoom

Industrial

Bedroom

Kitchen
Bedroom 0.630
LivingRoom

Kitchen

Kitchen

LivingRoom
LivingRoom 0.510
Office

Bedroom

Store

Office
Office 0.950
InsideCity

Store

LivingRoom

LivingRoom
Industrial 0.670
TallBuilding

Coast

Suburb

Coast
Suburb 0.990
TallBuilding

Industrial

InsideCity
InsideCity 0.770
Bedroom

Highway

Store

LivingRoom
TallBuilding 0.830
InsideCity

Industrial

InsideCity

Industrial
Street 0.810
TallBuilding

Industrial

Industrial

Store
Highway 0.850
Coast

Industrial

Suburb

InsideCity
OpenCountry 0.700
Coast

Coast

Coast

Coast
Coast 0.730
TallBuilding

OpenCountry

Forest

OpenCountry
Mountain 0.920
Industrial

OpenCountry

Suburb

Bedroom
Forest 0.930
Mountain

Highway

Mountain

OpenCountry
Category name Accuracy Sample training images Sample true positives False positives with true label False negatives with wrong predicted label