The objective of the project Scene recognition with bag of words is to recognize a scene using tiny images and bags of local features and perform classification with nearest neighbor and support vector machine method. The best accuracy ever achieved by this program is 68.5% recognition accuracy.
Using the 16*16 tiny image representation and training the data with a nearest neighbor classifier, the overall accuracy of the system reaches 22.5%.
Category name | Accuracy | Sample training images | Sample true positives | False positives with true label | False negatives with wrong predicted label | ||||
---|---|---|---|---|---|---|---|---|---|
Kitchen | 0.080 | LivingRoom |
Coast |
Coast |
Bedroom |
||||
Store | 0.020 | Forest |
Forest |
TallBuilding |
InsideCity |
||||
Bedroom | 0.180 | Industrial |
Office |
LivingRoom |
Highway |
||||
LivingRoom | 0.100 | Coast |
Bedroom |
Mountain |
Industrial |
||||
Office | 0.160 | Forest |
LivingRoom |
Forest |
Bedroom |
||||
Industrial | 0.130 | Mountain |
InsideCity |
Mountain |
Bedroom |
||||
Suburb | 0.370 | Kitchen |
Street |
OpenCountry |
Forest |
||||
InsideCity | 0.060 | Kitchen |
Bedroom |
Forest |
Highway |
||||
TallBuilding | 0.230 | Industrial |
LivingRoom |
Coast |
InsideCity |
||||
Street | 0.420 | Store |
Suburb |
Forest |
Coast |
||||
Highway | 0.560 | TallBuilding |
LivingRoom |
Coast |
Mountain |
||||
OpenCountry | 0.350 | Coast |
Forest |
Coast |
Coast |
||||
Coast | 0.400 | Office |
Mountain |
OpenCountry |
InsideCity |
||||
Mountain | 0.180 | InsideCity |
Kitchen |
Coast |
Office |
||||
Forest | 0.130 | Kitchen |
Store |
Street |
Street |
||||
Category name | Accuracy | Sample training images | Sample true positives | False positives with true label | False negatives with wrong predicted label |
Two steps are conducted in this part. First, a vocabulary of size 200 is built using the training data. 20 features are randomly obtained from each image, with k-means clustering method applied afterwards to obtain the 500 centers for the vocabulary. Next, 1000 features are obtained from each image to form the histogram for each training and testin image. The label of the testing image is assigned by its nearest neighbor image in the training set. The overall accuracy reaches 52.3%.
Category name | Accuracy | Sample training images | Sample true positives | False positives with true label | False negatives with wrong predicted label | ||||
---|---|---|---|---|---|---|---|---|---|
Kitchen | 0.410 | Bedroom |
Street |
InsideCity |
InsideCity |
||||
Store | 0.400 | LivingRoom |
Industrial |
LivingRoom |
Bedroom |
||||
Bedroom | 0.390 | InsideCity |
TallBuilding |
Kitchen |
TallBuilding |
||||
LivingRoom | 0.240 | TallBuilding |
Suburb |
Industrial |
Office |
||||
Office | 0.710 | Kitchen |
Kitchen |
LivingRoom |
Kitchen |
||||
Industrial | 0.320 | TallBuilding |
InsideCity |
Forest |
InsideCity |
||||
Suburb | 0.760 | Mountain |
Store |
Store |
InsideCity |
||||
InsideCity | 0.360 | Store |
Industrial |
Coast |
Industrial |
||||
TallBuilding | 0.460 | Industrial |
InsideCity |
Suburb |
Industrial |
||||
Street | 0.500 | TallBuilding |
InsideCity |
LivingRoom |
Store |
||||
Highway | 0.750 | OpenCountry |
Suburb |
Coast |
OpenCountry |
||||
OpenCountry | 0.450 | Highway |
Coast |
Forest |
Mountain |
||||
Coast | 0.620 | OpenCountry |
Highway |
Suburb |
Highway |
||||
Mountain | 0.520 | Industrial |
Bedroom |
Street |
Forest |
||||
Forest | 0.950 | OpenCountry |
Store |
Mountain |
Mountain |
||||
Category name | Accuracy | Sample training images | Sample true positives | False positives with true label | False negatives with wrong predicted label |
This part differes from part2 only by the classifier chosen. Instead of using nearest neighbor classifier, 15 support vector machine classifiers are used with one for each class. The algorithm reaches 63.1% accuracy.
Category name | Accuracy | Sample training images | Sample true positives | False positives with true label | False negatives with wrong predicted label | ||||
---|---|---|---|---|---|---|---|---|---|
Kitchen | 0.520 | Office |
Store |
Bedroom |
Industrial |
||||
Store | 0.440 | InsideCity |
Kitchen |
LivingRoom |
Bedroom |
||||
Bedroom | 0.420 | Highway |
Kitchen |
Kitchen |
Store |
||||
LivingRoom | 0.270 | OpenCountry |
InsideCity |
Bedroom |
Store |
||||
Office | 0.820 | Kitchen |
LivingRoom |
LivingRoom |
Kitchen |
||||
Industrial | 0.530 | InsideCity |
Street |
Coast |
Highway |
||||
Suburb | 0.930 | OpenCountry |
OpenCountry |
InsideCity |
Highway |
||||
InsideCity | 0.430 | Highway |
TallBuilding |
Industrial |
Industrial |
||||
TallBuilding | 0.640 | Industrial |
Industrial |
Industrial |
Store |
||||
Street | 0.640 | Bedroom |
TallBuilding |
LivingRoom |
InsideCity |
||||
Highway | 0.790 | OpenCountry |
OpenCountry |
OpenCountry |
InsideCity |
||||
OpenCountry | 0.460 | Coast |
Mountain |
Suburb |
Coast |
||||
Coast | 0.820 | OpenCountry |
Industrial |
OpenCountry |
OpenCountry |
||||
Mountain | 0.790 | Forest |
Forest |
Highway |
Forest |
||||
Forest | 0.960 | OpenCountry |
OpenCountry |
Mountain |
Mountain |
||||
Category name | Accuracy | Sample training images | Sample true positives | False positives with true label | False negatives with wrong predicted label |
Instead of using the nearest neighbor for classfying features into clusters, each feature casts a vote on each cluster as long as it is near enough to the cluster center. In practice, the feature gives a vote with value 1 to its nearest cluster center. It will also give a vote with the value to every other cluster based on the ratio of the distance between feature and nearest cluster center and the distance between feature and the corresponding cluster center. This ratio is thresholded at the level of 0.8 to prevent giving out too many votes to completely irrevelant clusters. The accuracy is boosted by a bit, reaching 67% eventually.
Category name | Accuracy | Sample training images | Sample true positives | False positives with true label | False negatives with wrong predicted label | ||||
---|---|---|---|---|---|---|---|---|---|
Kitchen | 0.500 | Industrial |
Bedroom |
Bedroom |
Industrial |
||||
Store | 0.540 | Kitchen |
LivingRoom |
Bedroom |
Kitchen |
||||
Bedroom | 0.490 | Office |
Kitchen |
Kitchen |
LivingRoom |
||||
LivingRoom | 0.320 | Industrial |
Bedroom |
Street |
Bedroom |
||||
Office | 0.770 | Bedroom |
Store |
Bedroom |
Kitchen |
||||
Industrial | 0.550 | Street |
Kitchen |
Highway |
Suburb |
||||
Suburb | 0.950 | LivingRoom |
Bedroom |
Industrial |
Coast |
||||
InsideCity | 0.580 | Store |
Street |
Mountain |
Kitchen |
||||
TallBuilding | 0.700 | Store |
Street |
Street |
Kitchen |
||||
Street | 0.650 | Highway |
TallBuilding |
Highway |
Industrial |
||||
Highway | 0.850 | Coast |
Coast |
TallBuilding |
InsideCity |
||||
OpenCountry | 0.600 | Industrial |
Industrial |
TallBuilding |
Mountain |
||||
Coast | 0.740 | Highway |
OpenCountry |
OpenCountry |
Mountain |
||||
Mountain | 0.810 | OpenCountry |
Industrial |
Suburb |
Suburb |
||||
Forest | 0.950 | Mountain |
Mountain |
Industrial |
Mountain |
||||
Category name | Accuracy | Sample training images | Sample true positives | False positives with true label | False negatives with wrong predicted label |
I have used cross validation for reporting the performance of the system. 100 training and testing images are randomly picked in each iteration. Average accuracy and standard deviation are reported. However, cross-validation runs pretty slow as it requires a few iterations of training. I have only run the algorithm for reporting accuracy for tiny images with nearest neighbor, and bags of SIFT features with svm. The result is visualized in the following table.
# | Tiny Images with Nearest Neighbor | Bags of SIFT with Support Vector Machine |
---|---|---|
1 | 0.231 | 0.672 |
2 | 0.243 | 0.683 |
3 | 0.223 | 0.685 |
4 | 0.215 | 0.677 |
5 | 0.223 | 0.669 |
6 | 0.216 | 0.676 |
7 | 0.224 | 0.670 |
8 | 0.231 | 0.677 |
9 | 0.241 | 0.661 |
10 | 0.235 | 0.681 |
Mean | 0.2282 | 0.6751 |
Standard Deviation | 0.0097 | 0.0073 |
The system presents pretty stable performance upon the cross validation.
I have done some experiments with the vocabulary size used and evaluated the accuracy. The experiment is on the bags of SIFT features with soft assignment and svm classifier. The result is shown in the following table.
Vocabulary Size | Accuracy | Average time for obtaining feature histogram for 100 images (s) |
---|---|---|
10 | 0.405 | 9.634 |
20 | 0.536 | 9.344 |
50 | 0.595 | 10.672 |
100 | 0.642 | 12.602 |
200 | 0.650 | 17.846 |
500 | 0.669 | 31.363 |
1000 | 0.671 | 50.303 |
10000 | 0.653 | 337.792 |
As the table shows, the accuracy increases as the vocabulary size increases, but capped at around 65% to 70%. When the vocabulary size reaches 100, the accuracy hardly increase as the vocabulary size increases. Therefore, the best accuracy that this system can probably achieve is around 65% to 70%. The running time increases a lot as the vocabulary increases. Given that the accuracy performance does not change a lot, it is probably unwise to keep a huge vocabulary.
I have also done some experiments with the scale of the image. The image gets gaussian blurred in each scale and the corresponding accuracy is reported in the following table.
Sigma | Accuracy |
---|---|
0 (Unblurred) | 0.671 |
2^0.5 | 0.577 |
2^1.0 | 0.553 |
2^1.5 | 0.509 |
2^2.0 | 0.479 |
2^2.5 | 0.449 |
As the table shows, the accuracy of the recoginition drops by a few percent in each level. As the images are gradually losing high-ferequency information, it makes sense that the accuracy is in a decreasing trend.