The coding for the main tasks was very straightforward this time. Therefore I will only report the accuracy of the three pipelines. They are close to what we were told to expect and maybe a little low because I would rather get to work on the extra credit than tune parameters for a suboptimal pipeline.
Pipeline | Accuracy | Number vocabulary size |
---|---|---|
tiny images + nearest neighbor | 19.4 | |
bag of SIFT + nearest neighbor | 46.7 | 200 |
bag of SIFT + 1 vs all linear SVM | 59.1 | 256 |
These were the best results that I was able to achieve. Please note that vl_svmtrain uses a stochastic optimization method and thus the accuracy may be lower than 55% in some cases, though it is in the 57-59% range most of the time.
For the bag of SIFT setup I have sampled features from multiple levels of a Gaussian pyramid. Due to runtime concers regarding the k-means clustering I have only run this with SIFT features of cell size 8 and at a step length of 16. Considering features at multiple scales improved the performance mildly, however, using deep Gaussian pyramids (large L) actually decreased the performance. I attribute this to a lot of the images looking too similar at very small scales.
Pyramid levels L | Accuracy |
---|---|
1 | 57.1 |
2 | 57.3 |
3 | 58.5 |
4 | 57.9 |
5 | 54.1 |
Furthermore I have also tried non-linear SVMs with various kernels. The SVM code differs from the linear SVM code in that I have used matlabs SVM implementation and trained 105 1vs1 SVMs instead of 15 1vsAll SVMs because that was faster. For the data in the following table I have applied the non-linear SVM classifier to the best setup from the previous section, i.e. the bag-of-SIFT model with L = 3.
The first case (linear) was actually only meant to check, if the implementation works but as it turns out 1vs1-SVMs do not only train faster but also show superior classfication performance (the linear SVM achieved 58.5% vs. 64.3% by the non-linear SVM with linear kernel). Next, I tried the polynomial kernel which is apparently not a good fit for comparing histograms. After that came the Gaussian kernel, which performed really badly (7%), so not better than chance. This was very surprising and the confusion matrix showed that it classified basically every image as class 15. I do not think, it is a bug in my code because I have only changed the kernel function and the other kernel functions work well. Lastly I have implemented the chi-squared kernel as a custom kernel with vlfeat's vl_alldist2 function. The accuracy was an astonishing 69.5% which is 5% better than with the linear kernel and an 11% improvement over the linear SVM.
Kernel | Accuracy (%) | Training Time (s) | Query Time (s) |
---|---|---|---|
Linear | 64.3 | 2.8 | 0.3 |
Polynomial of degree 4 | 41.7 | 228 | 2.2 |
Gaussian | 6.9 | 3.1 | 3.5 |
Chi-Squared | 69.5 | 4.8 | 23.0 |
I have also implemented fisher vectors for feature quantization. My implementation follows this paper. That means that learning the "vocabulary", i.e. the GMM, consists of, first, collecting SIFT features with a cell size of 8 pixels and a step size of 16 pixels at 5 levels of a Gaussian pyramid. Then I use PCA to reduce the 128 dimensional features to their 64 dimensions. Finally I fit the GMM with vl_gmm, so only isotropic variances.
Classifier | Accuracy (%) | Training Time (s) | Query Time (s) |
---|---|---|---|
Non-linear SVM with chi-squared kernel | 7.7 | 1604 | 13712 |
1vs1 linear SVM | 73.5 | 113 | 14.8 |
1vsAll linear SVM | 78.1 | 8.8 | 2.1 |
First I tested the non-linear SVM classifier with the chi-squared kernel on this, just because it worked so well for the bag of SIFT words. Here, however, it did not work at all and in the confusion matrix you could see that the images were just categorized randomly. A possible explanation is that using Fisher vectors is already equivalent to working with a non-linear kernel and that somehow makes a second kernel redundant or incompatible. So I went back to linear SVMs, which is what they used in the paper as well, and actually got the best results with the original 1vsAll classifier.
I achieved my best results with fisher vectors and the normal 1vsAll linear SVM.
Accuracy of 78.1%
Category name | Accuracy | Sample training images | Sample true positives | False positives with true label | False negatives with wrong predicted label | ||||
---|---|---|---|---|---|---|---|---|---|
Kitchen | 0.620 | Store |
Bedroom |
LivingRoom |
Bedroom |
||||
Store | 0.810 | LivingRoom |
Industrial |
Bedroom |
Kitchen |
||||
Bedroom | 0.630 | LivingRoom |
Kitchen |
Kitchen |
LivingRoom |
||||
LivingRoom | 0.510 | Office |
Bedroom |
Store |
Office |
||||
Office | 0.950 | InsideCity |
Store |
LivingRoom |
LivingRoom |
||||
Industrial | 0.670 | TallBuilding |
Coast |
Suburb |
Coast |
||||
Suburb | 0.990 | TallBuilding |
Industrial |
InsideCity |
|||||
InsideCity | 0.770 | Bedroom |
Highway |
Store |
LivingRoom |
||||
TallBuilding | 0.830 | InsideCity |
Industrial |
InsideCity |
Industrial |
||||
Street | 0.810 | TallBuilding |
Industrial |
Industrial |
Store |
||||
Highway | 0.850 | Coast |
Industrial |
Suburb |
InsideCity |
||||
OpenCountry | 0.700 | Coast |
Coast |
Coast |
Coast |
||||
Coast | 0.730 | TallBuilding |
OpenCountry |
Forest |
OpenCountry |
||||
Mountain | 0.920 | Industrial |
OpenCountry |
Suburb |
Bedroom |
||||
Forest | 0.930 | Mountain |
Highway |
Mountain |
OpenCountry |
||||
Category name | Accuracy | Sample training images | Sample true positives | False positives with true label | False negatives with wrong predicted label |