This project focused on the task of recognizing various scenes using different kinds of image features and machine learning methods. The three combinations of features and matching methods used for the project were:
The first part of the project required coding a rather ineffective image feature, "tiny images". This encompassed reducing an image to the size of 16x16, making the average of the pixels' intensities zero, normalizing, and returning a 1x256 vector. In order to compare scenes' features for matching, I started by encoding the nearest neighbor classification method, which involved giving each testing image feature the label of the training image feature it was the closest to, in terms of pairwise distances. The result of this combination was an average of 22.4% accuracy as demonstrated by the confusion matrix below:
Category name | Accuracy | Sample training images | Sample true positives | False positives with true label | False negatives with wrong predicted label | ||||
---|---|---|---|---|---|---|---|---|---|
Kitchen | 0.080 | Bedroom |
Street |
Coast |
Bedroom |
||||
Store | 0.020 | Forest |
Forest |
TallBuilding |
InsideCity |
||||
Bedroom | 0.180 | Industrial |
Office |
LivingRoom |
Highway |
||||
LivingRoom | 0.100 | Coast |
Bedroom |
Mountain |
Industrial |
||||
Office | 0.160 | Forest |
LivingRoom |
Forest |
Bedroom |
||||
Industrial | 0.120 | Forest |
TallBuilding |
Highway |
Highway |
||||
Suburb | 0.370 | Kitchen |
Street |
InsideCity |
Kitchen |
||||
InsideCity | 0.060 | Store |
LivingRoom |
Street |
Coast |
||||
TallBuilding | 0.230 | Industrial |
Office |
Mountain |
Industrial |
||||
Street | 0.420 | Store |
InsideCity |
Industrial |
LivingRoom |
||||
Highway | 0.560 | Street |
LivingRoom |
Coast |
Mountain |
||||
OpenCountry | 0.350 | Coast |
Forest |
Coast |
Coast |
||||
Coast | 0.400 | Office |
Mountain |
OpenCountry |
InsideCity |
||||
Mountain | 0.180 | InsideCity |
Kitchen |
Coast |
Office |
||||
Forest | 0.130 | Kitchen |
Store |
Suburb |
Coast |
||||
Category name | Accuracy | Sample training images | Sample true positives | False positives with true label | False negatives with wrong predicted label |
The next part of the project involved using a more sophisticated and improved image feature, the bag of words. In the case of images, a "word" was a SIFT feature sampled from an image that either featured often in the image or had many SIFT features similar to it. The first step in the process was building a vocabulary of "visual words" which SIFT features could be compared to. This was done by performing the unsupervised machine learning method known as k-means clustering, with k = 200. This produced 200 SIFT features which were the centers of "clusters" of similar SIFT features. Constructing a bag of SIFTs for each image encompassed making a histogram of 200 bins and filling each bin with the number of SIFT features that were close to the bin's corresponding SIFT "word". The SIFT histogram was then normalized and used as a feature representing its image. The bags of SIFTs were then used in tandem with the nearest neighbor classifier to see if they yielded better results in comparison to the tiny images. The result was 50.3% accuracy with a step size of 10 for sampling SIFT features while building the vocabulary and a step size of 10 while sampling SIFT features during the bag building process. Using a step size of 7 in the bag building process yielded a higher accuracy of 52.1% but resulted in the process taking longer (a little less than 12 minutes) Look at the confusion matrix and table below to see the results of the 52.1% accurate results.
Category name | Accuracy | Sample training images | Sample true positives | False positives with true label | False negatives with wrong predicted label | ||||
---|---|---|---|---|---|---|---|---|---|
Kitchen | 0.390 | Office |
InsideCity |
Office |
Office |
||||
Store | 0.490 | Bedroom |
LivingRoom |
Industrial |
LivingRoom |
||||
Bedroom | 0.280 | LivingRoom |
InsideCity |
Kitchen |
Office |
||||
LivingRoom | 0.350 | Bedroom |
Store |
TallBuilding |
Store |
||||
Office | 0.730 | Bedroom |
Bedroom |
Bedroom |
LivingRoom |
||||
Industrial | 0.270 | Street |
Coast |
InsideCity |
InsideCity |
||||
Suburb | 0.890 | OpenCountry |
InsideCity |
Highway |
Street |
||||
InsideCity | 0.290 | Store |
Store |
Kitchen |
Highway |
||||
TallBuilding | 0.370 | InsideCity |
Industrial |
Street |
LivingRoom |
||||
Street | 0.550 | LivingRoom |
Suburb |
Store |
Suburb |
||||
Highway | 0.770 | Industrial |
Coast |
Store |
OpenCountry |
||||
OpenCountry | 0.450 | Highway |
Industrial |
Suburb |
Highway |
||||
Coast | 0.520 | InsideCity |
TallBuilding |
Highway |
Highway |
||||
Mountain | 0.550 | Coast |
OpenCountry |
Highway |
OpenCountry |
||||
Forest | 0.910 | TallBuilding |
OpenCountry |
Mountain |
Mountain |
||||
Category name | Accuracy | Sample training images | Sample true positives | False positives with true label | False negatives with wrong predicted label |
The final task in this assignment was attempting to further improve the scene recognition's performance by coding more proficient classifiers. The classifier that was coded for this part of the process was a group of 15 support vector machines. A support vector machine is essentially a binary classifier represented as an equation w*x +b where x is a vector of inputs and w is a vector of weights. A 1-vs-all SVM was generated for every category of labels using bags of sifts from a set of training images. Bags of sifts from testing images were then processed through the equations that represented the SVMs, and predictive labels were assigned to the testing image depending on which SVM yielded the highest score. The highest accuracy of 67.0% was yielded by using an SVM regularization factor of 0.0001 and a step size of 7 to allow for denser sampling of SIFTs. However, this process could prove to be slow at close to 11 minutes. Therefore, changing the step size to 10 while keeping the regularization factor the same resulted in a faster speed of about 6 minutes but a lower yet significantly decent accuracy of 63.9%. Keeping the step size at 10, but decreasing the regularization factor to 0.00001 was not helpful, for it yielded an accuracy of 59.3%. Here are the results for the best run which yielded 67%:
Category name | Accuracy | Sample training images | Sample true positives | False positives with true label | False negatives with wrong predicted label | ||||
---|---|---|---|---|---|---|---|---|---|
Kitchen | 0.580 | LivingRoom |
LivingRoom |
Bedroom |
Bedroom |
||||
Store | 0.530 | LivingRoom |
Kitchen |
Bedroom |
Office |
||||
Bedroom | 0.390 | LivingRoom |
Kitchen |
Office |
Store |
||||
LivingRoom | 0.300 | InsideCity |
Bedroom |
InsideCity |
Kitchen |
||||
Office | 0.870 | LivingRoom |
Bedroom |
LivingRoom |
Kitchen |
||||
Industrial | 0.540 | InsideCity |
Store |
TallBuilding |
Mountain |
||||
Suburb | 0.930 | Highway |
Coast |
Store |
Street |
||||
InsideCity | 0.590 | Street |
LivingRoom |
Industrial |
Store |
||||
TallBuilding | 0.750 | Bedroom |
Store |
Mountain |
Industrial |
||||
Street | 0.690 | Industrial |
InsideCity |
LivingRoom |
Industrial |
||||
Highway | 0.770 | Street |
OpenCountry |
Coast |
Suburb |
||||
OpenCountry | 0.490 | Coast |
Industrial |
Forest |
Highway |
||||
Coast | 0.830 | OpenCountry |
Industrial |
Mountain |
Mountain |
||||
Mountain | 0.850 | LivingRoom |
TallBuilding |
Forest |
Suburb |
||||
Forest | 0.940 | Mountain |
OpenCountry |
Mountain |
Mountain |
||||
Category name | Accuracy | Sample training images | Sample true positives | False positives with true label | False negatives with wrong predicted label |