Project 4 / Scene Recognition with Bag of Words

We will be implementing 2 image representations and 2 classification techniques and run the code in the following ways.
  1. Tiny images representation and nearest neighbor classifier
  2. Bag of SIFT representation and nearest neighbor classifier
  3. Bag of SIFT representation and linear SVM classifier

Tiny images representation

We resize the original image to a very small square sized image (16x16). The tiny images are zero mean and unit length (normalized) to increase performance. Any image is resized to the same square box, irrespective of its size. We test this on images sized 8*8, 16*16, 32*32 and 64*64. Though 32*32 gives the maximum accuracy for a constant value of K but with varying K (number of nearest neighbors), 16*16 gives us the best accuracy. Hence, we select 16*16 to be our final image size.

Nearest neighbor classifier

We first find the pairwise distance matrix for test images vs train images. This gives us the euclidean distance between each test image and training image. We now sort this array row-wise and find a majority vote among the K closest train images for each test image. On testing for different values of K, we notice that K in the range 20 to 30 gives good results (around 24%-25% accuracy). Hence we pick an average value = 25 finally.

Tiny Image Size constant 16*16, varying K

1. Tiny images representation and nearest neighbor classifier

This leads us to run our first analysis. On running the code with image size = 16*16 and different values of K, we observe the following trends. We notice that for smaller values of K we do not get good accuracies. As we increase k to 20 we see a steady increase in the accuracy and the best accuracies lie between K=20 and K=30. We finally pick an average value of K=25.






Now we fix the value of K and run our analysis on different tiny image sizes. (Tiny Image Size, Accuracy)
Here we notice that the accuracy increases from 8 to 32 and then decreases for 64.

(8*8, 22.5%)                                                               (16*16, 24.3%)                                                               (32*32, 24.5%)                                                               (64*64, 24.3%)

Bag of SIFT representation


build_vocabulary.m
We randomly sample 15 descriptors from each image. Once we have tens of thousands of SIFT features from many training images, we cluster them with k-means. The resulting centroids are now the visual word vocabulary.

get_bags_of_sifts.m
We construct SIFT features here by taking our input image and finding the closest SIFT features that we constructed from build_vocabulary. We now build a histogram indicating how many times each SIFT feature was used and normalize it for each image.

2. Bag of SIFT representation and nearest neighbor classifier

This leads us to run our second analysis. This gives an accuracy ranging from 60%-68% for different vocab sizes.

Linear SVM classifier

We now train 15 binary 1-vs-all linear SVMS to operate in the bag of SIFT feature space where each 1-vs-all classifier will be trained to recognize 'forest' vs 'non-forest', 'kitchen' vs 'non-kitchen', etc. All 15 classifiers will be evaluated on each test case and the classifier which is most confidently positive "wins". When learning an SVM, we modify the lambda parameter (for regularization) and set it to the most optimal value.

3. Bag of SIFT representation and Linear SVM classifier

This leads us to run our third analysis. On running the code with vocab size in the range [100,200,500,1000], we observe that the accuracy of the model increases from 60% to 70%.

Plot for Accuracy vs Vocab Size ([100,200,500,1000])

Accuracy (mean of vocab_size =200) is 67.1%

Category name Accuracy Sample training images Sample true positives False positives with true label False negatives with wrong predicted label
Kitchen 0.080
Industrial

Forest

Highway

Forest
Store 0.040
Office

TallBuilding

TallBuilding

Coast
Bedroom 0.200
InsideCity

Store

OpenCountry

Mountain
LivingRoom 0.100
Office

Kitchen

Bedroom

Office
Office 0.190
Mountain

Bedroom

Industrial

InsideCity
Industrial 0.130
LivingRoom

TallBuilding

InsideCity

Store
Suburb 0.470
Coast

InsideCity

Kitchen

Street
InsideCity 0.030
Suburb

Coast

TallBuilding

Store
TallBuilding 0.180
Store

Forest

LivingRoom

Coast
Street 0.600
Kitchen

Store

Highway

Suburb
Highway 0.750
Suburb

Industrial

Coast

Mountain
OpenCountry 0.310
Forest

Kitchen

Suburb

Coast
Coast 0.400
Mountain

Mountain

InsideCity

OpenCountry
Mountain 0.290
Store

Bedroom

Kitchen

Forest
Forest 0.140
Mountain

Street

Suburb

Kitchen
Category name Accuracy Sample training images Sample true positives False positives with true label False negatives with wrong predicted label