Project 4 / Scene Recognition with Bag of Words
We will be implementing 2 image representations and 2 classification techniques and run the code in the following ways.
- Tiny images representation and nearest neighbor classifier
- Bag of SIFT representation and nearest neighbor classifier
- Bag of SIFT representation and linear SVM classifier
Tiny images representation
We resize the original image to a very small square sized image (16x16). The tiny images are zero mean and
unit length (normalized) to increase performance. Any image is resized to the same square box, irrespective of its
size. We test this on images sized 8*8, 16*16, 32*32 and 64*64. Though 32*32 gives the maximum accuracy for a constant value of K but with varying K (number of nearest neighbors), 16*16 gives us the best accuracy. Hence, we select 16*16 to be our final image size.
Nearest neighbor classifier
We first find the pairwise distance matrix for test images vs train images. This gives us the euclidean distance between each test image and training image. We now sort this array row-wise and find a majority vote among the K closest train images for each test image.
On testing for different values of K, we notice that K in the range 20 to 30 gives good results (around 24%-25% accuracy). Hence we pick an average value = 25 finally.
Tiny Image Size constant 16*16, varying K
1. Tiny images representation and nearest neighbor classifier
This leads us to run our first analysis. On running the code with image size = 16*16 and different values of K, we observe the following trends. We notice that for smaller values of K we do not get good accuracies. As we increase k to 20 we see a steady increase in the accuracy and the best accuracies lie between K=20 and K=30. We finally pick an average value of K=25.
Now we fix the value of K and run our analysis on different tiny image sizes. (Tiny Image Size, Accuracy)
Here we notice that the accuracy increases from 8 to 32 and then decreases for 64.
Bag of SIFT representation
build_vocabulary.m
We randomly sample 15 descriptors from each image. Once we have tens of thousands of SIFT features from many training images, we cluster them with k-means. The resulting centroids are now the visual word vocabulary.
get_bags_of_sifts.m
We construct SIFT features here by taking our input image and finding the closest SIFT features that we constructed from build_vocabulary. We now build a histogram indicating how many times each SIFT feature was used and normalize it for each image.
2. Bag of SIFT representation and nearest neighbor classifier
This leads us to run our second analysis. This gives an accuracy ranging from 60%-68% for different vocab sizes.
Linear SVM classifier
We now train 15 binary 1-vs-all linear SVMS to operate in the bag of SIFT feature space where each 1-vs-all classifier will be trained to recognize 'forest' vs 'non-forest', 'kitchen' vs 'non-kitchen', etc. All 15 classifiers will be evaluated on each test case and the classifier which is most confidently positive "wins". When learning an SVM, we modify the lambda parameter (for regularization) and set it to the most optimal value.
3. Bag of SIFT representation and Linear SVM classifier
This leads us to run our third analysis. On running the code with vocab size in the range [100,200,500,1000], we observe that the accuracy of the model increases from 60% to 70%.
Plot for Accuracy vs Vocab Size ([100,200,500,1000])
Accuracy (mean of vocab_size =200) is 67.1%
Category name |
Accuracy |
Sample training images |
Sample true positives |
False positives with true label |
False negatives with wrong predicted label |
Kitchen |
0.080 |
|
|
|
|
Industrial |
Forest |
Highway |
Forest |
Store |
0.040 |
|
|
|
|
Office |
TallBuilding |
TallBuilding |
Coast |
Bedroom |
0.200 |
|
|
|
|
InsideCity |
Store |
OpenCountry |
Mountain |
LivingRoom |
0.100 |
|
|
|
|
Office |
Kitchen |
Bedroom |
Office |
Office |
0.190 |
|
|
|
|
Mountain |
Bedroom |
Industrial |
InsideCity |
Industrial |
0.130 |
|
|
|
|
LivingRoom |
TallBuilding |
InsideCity |
Store |
Suburb |
0.470 |
|
|
|
|
Coast |
InsideCity |
Kitchen |
Street |
InsideCity |
0.030 |
|
|
|
|
Suburb |
Coast |
TallBuilding |
Store |
TallBuilding |
0.180 |
|
|
|
|
Store |
Forest |
LivingRoom |
Coast |
Street |
0.600 |
|
|
|
|
Kitchen |
Store |
Highway |
Suburb |
Highway |
0.750 |
|
|
|
|
Suburb |
Industrial |
Coast |
Mountain |
OpenCountry |
0.310 |
|
|
|
|
Forest |
Kitchen |
Suburb |
Coast |
Coast |
0.400 |
|
|
|
|
Mountain |
Mountain |
InsideCity |
OpenCountry |
Mountain |
0.290 |
|
|
|
|
Store |
Bedroom |
Kitchen |
Forest |
Forest |
0.140 |
|
|
|
|
Mountain |
Street |
Suburb |
Kitchen |
Category name |
Accuracy |
Sample training images |
Sample true positives |
False positives with true label |
False negatives with wrong predicted label |