Using tiny images and nearest neighbor gave the worst performance out of all of the methods. For my implementation I used images resized to 16x16 with no respect for original aspect ratio. In this manner, image data is not thrown out from cropping. Nearest neighbor simply uses vl_alldist2() and min() to find the neighbor with the smallest euclidian distance. Using these together, I was able to acheive 19.1% accuracy. Below is the confusion matrix for this part.
Accuracy (mean of diagonal of confusion matrix) is 0.191
Using bag of SIFT and nearest neighbor gave quite a bit better performance over tiny images and nearest neighbor, however it took significantly longer. Whereas tiny images and nearest neighbor only took about 30 seconds, but running bag of SIFT and nearest neighbor took about 8 minutes. I increased the vocabulary size to 500 and used a fairly large bin and step size for the SIFT descriptor. Then when getting the 'bags', I used a fairly small bin and step size. This was able to bring my accuracy up to 50.5%. Below is the confusion matrix for this part.
Accuracy (mean of diagonal of confusion matrix) is 0.505
Using bag of SIFT and linear SVM increased accuracy even more and took only slightly longer than Bag of SIFT with nearest neighbor. For my linear SVM, I used a lambda of 0.0001 and found that decreasing lambda didn't really provide much better results, and increasing reduced accuracy pretty drastically. Using linear SVM over nearest neighbor, I was able to get the accuracy up to 67.2%. Below is the confusion matrix for this part, as well as some sample classifications.
Category name | Accuracy | Sample training images | Sample true positives | False positives with true label | False negatives with wrong predicted label | ||||
---|---|---|---|---|---|---|---|---|---|
Kitchen | 0.640 | LivingRoom |
Store |
Bedroom |
InsideCity |
||||
Store | 0.560 | Industrial |
LivingRoom |
InsideCity |
TallBuilding |
||||
Bedroom | 0.430 | LivingRoom |
LivingRoom |
Industrial |
Store |
||||
LivingRoom | 0.340 | Industrial |
Bedroom |
Bedroom |
Bedroom |
||||
Office | 0.910 | Bedroom |
Kitchen |
Kitchen |
Kitchen |
||||
Industrial | 0.550 | Street |
Kitchen |
Highway |
Suburb |
||||
Suburb | 0.960 | InsideCity |
OpenCountry |
TallBuilding |
Bedroom |
||||
InsideCity | 0.550 | Store |
Street |
Highway |
Suburb |
||||
TallBuilding | 0.700 | InsideCity |
Suburb |
Industrial |
Industrial |
||||
Street | 0.700 | InsideCity |
LivingRoom |
InsideCity |
Highway |
||||
Highway | 0.810 | Street |
Industrial |
Coast |
Mountain |
||||
OpenCountry | 0.340 | Highway |
Highway |
Mountain |
Mountain |
||||
Coast | 0.810 | OpenCountry |
OpenCountry |
Mountain |
InsideCity |
||||
Mountain | 0.850 | Coast |
Store |
Forest |
Forest |
||||
Forest | 0.930 | OpenCountry |
Mountain |
Mountain |
Mountain |
||||
Category name | Accuracy | Sample training images | Sample true positives | False positives with true label | False negatives with wrong predicted label |
From the results above, it looks like this representation and classifier does a pretty good job of separating out scenes that look very different. For example, forest trees were not mistaken for city buildings. However, within a similar category, such as 'interior' (rooms such as the kitchen, bedroom, living room, etc), results were sometimes confused between the two. Out of all of the labels, LivingRoom had the lowest accuracy of 34% and was confused with other interior rooms such as Bedroom and Industrial. OpenCountry also had a low accuracy of 34% and was confused with outdoor scenes such as Highway and Mountain.