Project 4 / Scene Recognition with Bag of Words
Tiny Image + Nearest Neighbor Classifier
For the Tiny Image features, we simply resize each image to 16x16 and reshape it to a vector of 256.
In the nearest neighbor classifier, we use vl_alldist2 to calculate the distances. Thereafter we sort each row using the sort function. We then pick the most frequent label among the nearest k neighbors.
I found that k = 1 to be an appropriate value for this case. This gives an accuracy of 0.19 whereas k = 25 gives an accuracy of 0.185, which is not too different. (Specifically k=25 because that is the tuned value that we use for bag of SIFT).
This condition runs in close to 11 seconds
Bag of SIFT + Nearest Neighbor Classifier
In the build_vocabulary function, we use a step size of 32, and a bin size ('size' parameter) of 16. This seemed to provide better overall accuracy. Previously I was using the fast parameter here and a smaller bin size. It is justified to not use the fast parameter while computing the vocabulary because this is computed only once.
In the get_bags_of_sifts function, we use a step size of 8, a bin size of 16 and the fast parameter. I experimented with different step sizes, with 4 the runtime came to more than 10 minutes (without vocabulary building) and at higher values, the accuracy was just not enough. We use alldist2 to calculate the distance between each sift vector in the image with the vocabulary. We hence compute the histogram of the sift features for this image. Ultimately, we normalize the histogram so that number of features found in an image doesn't tamper with the results. The normalization also helped improve the accuracy of the condition.
As stated above, we use a k=25 value for the nearest neighbor classifier.
The accuracy we find in this condition is: 0.515 with a run time of 193 seconds (just over 3 minutes).
Bag of SIFT + 1-vs-all SVM Classifier
In the SVM classifier, for each category, we generate the binary labels and train the svm for these labels. We store the obtained W and B. Thereafter, for each test data point, we compute the confidence for each of the svms and consequently predict the label based on the highest confidence value.
For the regularization parameter, I found that a value of 0.0017 was most appropriate. I tried various values.
The accuracy we find in this condition is: 0.585 with a run time of 191 seconds (just over 3 minutes).
Scene classification results visualization
Accuracy (mean of diagonal of confusion matrix) is 0.585
Category name |
Accuracy |
Sample training images |
Sample true positives |
False positives with true label |
False negatives with wrong predicted label |
Kitchen |
0.370 |
|
|
|
|
LivingRoom |
Bedroom |
Suburb |
LivingRoom |
Store |
0.340 |
|
|
|
|
Industrial |
InsideCity |
Street |
Kitchen |
Bedroom |
0.280 |
|
|
|
|
LivingRoom |
LivingRoom |
Office |
Kitchen |
LivingRoom |
0.190 |
|
|
|
|
Kitchen |
Kitchen |
Bedroom |
Suburb |
Office |
0.870 |
|
|
|
|
Bedroom |
Kitchen |
Kitchen |
Kitchen |
Industrial |
0.150 |
|
|
|
|
LivingRoom |
Kitchen |
Mountain |
Kitchen |
Suburb |
0.910 |
|
|
|
|
Kitchen |
Store |
Kitchen |
Office |
InsideCity |
0.350 |
|
|
|
|
Kitchen |
Kitchen |
Industrial |
Office |
TallBuilding |
0.750 |
|
|
|
|
Store |
Street |
Forest |
Forest |
Street |
0.820 |
|
|
|
|
InsideCity |
InsideCity |
Bedroom |
Highway |
Highway |
0.770 |
|
|
|
|
Coast |
Industrial |
OpenCountry |
OpenCountry |
OpenCountry |
0.510 |
|
|
|
|
Coast |
Highway |
Suburb |
Highway |
Coast |
0.790 |
|
|
|
|
Bedroom |
Store |
Highway |
OpenCountry |
Mountain |
0.760 |
|
|
|
|
Store |
Industrial |
OpenCountry |
OpenCountry |
Forest |
0.910 |
|
|
|
|
Store |
Industrial |
Mountain |
Mountain |
Category name |
Accuracy |
Sample training images |
Sample true positives |
False positives with true label |
False negatives with wrong predicted label |