In this project, different methods for scene classification have been explored. Tiny images and bag of words features are used, and then two different classifiers are tried out and the results comapred.
Here, I downsample the image to 16x16 and use the 256 pixels thus obtained as the feature descriptor corresponding to that image. In my code, I first crop the image to make sure its aspect ratio is 1 before resizing to 16x16. After this, I implement the nearest neighbour classifier which assigns to every query image the category of its closest match in the training data. A slight variation involves checking the k closest neighbours and assigning the most occuring category among these neighbours. Using this pipeline, I got an accuracy of 19.5% for a K=4 using the standard distance metric.
Here, I build a bag of sifts model by creating a vocabulary of words. This is done by densely sampling all images in the training set and then using k-means to find a certain number of cluster centres. These centres form the visual words in the vocabulary. The feature for each image is constructed by taking a histogram of the number of occurrences of each visual word in it. For the submission, I use a vocabulary of 200 words. The below image shows the accuracies obtained for the submitted code with the bag of sift model with each of the two classifiers with all extra credit options turned off.
Nearest Neighbour Accuracy: 61.7% | SVM Accuracy: 64.7% |
I implemented the algorithm described by Boiman et al which does not quantise the descriptors at all, and uses descriptor to class distance to assign categories to query images. The authors claim that the descriptors are sparse and clustering them actually throws away a lot of information. Hence, I implemented the algorithm without quantisation. The biggest challenge, of course, was the sheer number of descriptors whose distances needed to be calculated. Because of this, I heavily increases the 'step' parameter to a value of 20 (values lower than 10 gave me out-of-memory errors).
Exact Nearest Neighbour: I tried computing the exact distances (i.e. without using a kd-tree). This took about 4 hours to run on my laptop, but gave me excellent results relative to the step size. I got an accuracy of 60.8% for a step size of 20. In comparison, the normal kNN gave about 43% for the same step size. Below is visualised the confusion matrix for this algorithm, and a kNN run at the same sampling size. Clearly, this method would have done way better given time or a more efficient way of computing distances.
Accuracy: 60.8% | Accuracy: 43.2% |
kd-tree: I also implemented the same with the kd-tree. This was about 3 times faster, but gave me a relatively inaccurate result, probably because kd-tree gives approximate nearest neighbours. I have still left this section in the code but it is turned off by default.
I implemented a soft binning method by adding weights based on the distances to the top four closest visual words. I tried different methods and finally settled by weighting them as r*(1-(di/d1+d2+d3+d4)), where r is maximum for the closest neighbour and grows progressively smaller for the other three. This increased my accuracy for SVM by about 5%, to a value of 69.8% with 'fast' settings and high step size. Here is the confusion matrix of the same:
Because vl_gmm takes comparable time to k-means, I also saved the variables it generated. Since this saves the relative position and variance with respect to each cluster centre, we have a feature vector of size N x 128*2*means. I got the best results overall using Fisher encoding. SVM gave me an accuracy of 76.7% for a feature with 100 means. Somewhat interestingly, nearest neighbour was not just slow, but also gave very poor results, probably because of the extreme high dimensionality of each vector.
Because this gave me the best results, I have included the entire confusion matrix and the table generated by the starter code for this case.
Category name | Accuracy | Sample training images | Sample true positives | False positives with true label | False negatives with wrong predicted label | ||||
---|---|---|---|---|---|---|---|---|---|
Kitchen | 0.590 | Bedroom |
Store |
Store |
Bedroom |
||||
Store | 0.790 | Industrial |
Industrial |
Kitchen |
TallBuilding |
||||
Bedroom | 0.610 | Kitchen |
Kitchen |
Office |
LivingRoom |
||||
LivingRoom | 0.510 | Bedroom |
Kitchen |
Office |
Mountain |
||||
Office | 0.940 | LivingRoom |
InsideCity |
Kitchen |
Kitchen |
||||
Industrial | 0.650 | Bedroom |
LivingRoom |
Store |
TallBuilding |
||||
Suburb | 0.990 | Industrial |
Industrial |
InsideCity |
|||||
InsideCity | 0.750 | TallBuilding |
Highway |
TallBuilding |
Kitchen |
||||
TallBuilding | 0.830 | InsideCity |
Industrial |
Street |
InsideCity |
||||
Street | 0.810 | InsideCity |
Forest |
InsideCity |
Mountain |
||||
Highway | 0.850 | Street |
Street |
Bedroom |
InsideCity |
||||
OpenCountry | 0.560 | Mountain |
InsideCity |
Coast |
Bedroom |
||||
Coast | 0.890 | OpenCountry |
Industrial |
OpenCountry |
OpenCountry |
||||
Mountain | 0.800 | OpenCountry |
Forest |
Suburb |
OpenCountry |
||||
Forest | 0.930 | OpenCountry |
OpenCountry |
OpenCountry |
Street |
||||
Category name | Accuracy | Sample training images | Sample true positives | False positives with true label | False negatives with wrong predicted label |