In this project the objective was to train a scene classifier and recognize scenes. To do this we implemented the following solutions:
I did not do anything fancy for this part and just resized the images using matlab functions. The nearest neighbor classifier is just a 1-nearest neighbor classifier.
I implemented SIFT vocabulary extraction and SIFT bag of words which essentially is just extracting the SIFT features from the images and building a histogram of counts of the nearest neighbors of these features. I had a problem for a while where I tried to use the histcounts matlab function and my accuracies were not good enough. The problem that we could not even detect with the TAs was that the function had unexpected behavior. It was discretizing the continuous range [1 max_nearest_neighbor_index] into vocab_size bins instead of binning vocab_size integers which means that if the max_nearest_neighbor_index of an image was less than vocab_size features were being binned incorrectly. After debuggin my code I found out this was the problem and just implemented a histogram count on my own which worked.
I then fine-tuned the regularization parameter of SVM before fixing the binning problem. The fine-tuned parameter 0.00001 worked well in next parts.
lambda | accuracy |
---|---|
1 | 0.395 |
0.1 | 0.316 |
0.01 | 0.307 |
0.001 | 0.427 |
0.0001 | 0.535 |
0.00005 | 0.569 |
0.00001 | 0.572 |
0.000001 | 0.559 |
0.0000001 | 0.492 |
0.00000001 | 0.469 |
I fine-tuned the size of the SIFT bins after fixing the binning issue. I used step 10 for building the vocabulary and step 10 for testing. The fast parameter was also used for both. The SVM classifier was used. The bin size used for the rest of the probject is 8x8.
bin size | accuracy |
---|---|
3x3 | 0.625 |
8x8 | 0.649 |
16x16 | 0.639 |
32x32 | 0.590 |
64x64 | 0.491 |
I tested the performance of my pipeline using different vocabulary sizes. I used step of 1 to build the vocabularies which ran overnight and took a long time (I wrote a script to build several vocabs). The size of the bins was 3x3 which is the default for vl_sift. I used this size of bins because I did this before optimizing the bin size. To test I used a step of 10 and the fast parameter.
vocab_size | accuracy | time to test |
---|---|---|
10 | 0.344 | 38s |
20 | 0.471 | 45s |
50 | 0.545 | 68s |
100 | 0.589 | 107s |
200 | 0.615 | 194s |
400 | 0.641 | 334s |
1000 | 0.639 | 786s |
As we can see the performance climbs until hitting diminishing returns at around the vocab size of 400. The performance then starts to slightly suffer from increasing vocab sizes. The time taken to test is also much higher for higher vocab sizes and the time to build is extremely long for large vocab sizes. A vocab size of 200 is very adequate.
I implemented SIFT with spatial pyramids. In essence I extract SIFT features from the image, then subdivide the image in four parts and extract the SIFT features of these four parts and count and concatenate the feature vectors and then subdivide it into 16 parts, extract the features and count and concatenate again. Using SVM with lambda 0.00001, vocab size 200, step vocab building 3, 8x8 bins and step 5 for SIFT feature extraction we have the following results.
accuracy | |
---|---|
without pyramids | 0.672 |
with pyramids | 0.701 |
I implemented fisher encoding by creting a get_vl_gmm script and a get_fisher_sifts script. The get_vl_gmm script essentially builds the vocabulary for fisher encoding using the vl_gmm function. The get_fisher_sifts creates the feature vectors using the vl_fisher functions. I tuned the vocabulary size for my fisher encoding.
vocab_size | accuracy |
---|---|
30 | 0.703 |
60 | 0.726 |
100 | 0.726 |
150 | 0.725 |
Same effect of diminishing returns. So we pick a vocabulary of 60. Note that the pipeline runs much faster and that we are using steps of 10 to extract SIFT features, and still getting good accuracies. I tried step 5 for the SIFT extraction and I got an accuracy of 0.741.
I tried to implement the same idea of using spatial pyramids with Fisher encoding. I implemented the program but the results were not any better or worse. I think Fisher encoding has some sort of spatial information encoding that is redundant with spatial pyramids.
I added GIST descriptors for images using Oliva and Torralba's code. The good part is that a vocabulary does not need to be built for GIST and we can compare images directly. I used GIST descriptors alongside SIFT features encoded with Fisher vectors. I had a slight accuracy increase.
accuracy | |
---|---|
SIFT w/ Fisher Encoding | 0.741 |
SIFT w/ Fisher Encoding + GIST | 0.743 |
Using Olivier Chapelle's matlab code I implemented a non-linear svm classifier using a gaussian RBF kernel. I compute the kernel using the training data with the kernel function that I found in wikipedia, train the classifier 1 vs all and then find the labels of the test data by first computing the kernel for the test features and computing labels = weights * kernel_test + biases.
My final result is 100% accuracy which had me thinking that I was cheating and using the test labels. I checked my code several times and can't find the moment where the classifier is cheating but I am pretty sure that it is because it gets 100% even with tiny image features. I reviewed the code extensively with a TA but couldn't find where the issue lies. If I change the sigma parameter to 100 I don't get 100% anymore but with a sigma parameter of 1 and lambda = 0.00001 I get 100% accuracy.
I present my most accurate code with SIFT and fisher vector encodings + GIST using a step of 3, bin size 8x8 and the fast parameter which gives me an accuracy of 0.752.
Category name | Accuracy | Sample training images | Sample true positives | False positives with true label | False negatives with wrong predicted label | ||||
---|---|---|---|---|---|---|---|---|---|
Kitchen | 0.650 | LivingRoom |
Bedroom |
LivingRoom |
Bedroom |
||||
Store | 0.730 | Industrial |
Bedroom |
Suburb |
Industrial |
||||
Bedroom | 0.570 | LivingRoom |
OpenCountry |
LivingRoom |
Office |
||||
LivingRoom | 0.570 | Bedroom |
TallBuilding |
Industrial |
Store |
||||
Office | 0.900 | Kitchen |
TallBuilding |
Bedroom |
Kitchen |
||||
Industrial | 0.580 | Coast |
Street |
InsideCity |
Mountain |
||||
Suburb | 0.990 | Industrial |
TallBuilding |
Forest |
|||||
InsideCity | 0.690 | Store |
Street |
Industrial |
Bedroom |
||||
TallBuilding | 0.800 | Store |
Street |
Mountain |
Industrial |
||||
Street | 0.810 | Forest |
TallBuilding |
Highway |
InsideCity |
||||
Highway | 0.830 | Coast |
OpenCountry |
LivingRoom |
Kitchen |
||||
OpenCountry | 0.680 | Coast |
Mountain |
Mountain |
Industrial |
||||
Coast | 0.750 | OpenCountry |
Highway |
Mountain |
Industrial |
||||
Mountain | 0.830 | Forest |
TallBuilding |
OpenCountry |
Forest |
||||
Forest | 0.900 | TallBuilding |
Bedroom |
Mountain |
OpenCountry |
||||
Category name | Accuracy | Sample training images | Sample true positives | False positives with true label | False negatives with wrong predicted label |