The aim here is to perform scene recognition using a basic bag of words model. We are to report performance for the following combinations:
We simply resize each image to a small, fixed resolution (I took 16x16). This is not a particularly good representation, because it discards all of the high frequency image content and is not especially invariant to spatial or brightness shifts.
When classifying a test feature into a particular category, we simply finds the nearest training example using L2 distance as the metric and assign the test case the label of that nearest training example. This is 1-NN. It is quite vulnerable to training noise, though, which can be alleviated by voting based on the K nearest neighbors. We can find the distance from K nearest neighbours and take the mode(most frequently occurring) label among the labels of the k closest neighbours. I implemented both 1-NN,K-NN.
We first need to establish a vocabulary of visual words.I used a step size of 5 to create the vocabulary and then clustered them with k-means. I varied the number of clusters and obtained the accuracy for each value of numver of clusters(vocab_size). The centroids of the clusters are stored as vocab.mat. We now represent our training and testing images as histograms of visual words. For each image we will densely sample many SIFT descriptors. Then we count how many SIFT descriptors fall into each cluster in our visual word vocabulary and create corresponding histograms. We normalize the histograms so that image size does not dramatically change the bag of feature magnitude.
Linear classifiers are inherently binary and we have a 15-way classification problem. We train 15 binary, 1-vs-all SVMsi.e. each classifier will be trained to recognize 'forest' vs 'nonforest','kitchen' vs 'nonkitchen',etc. Scores are calculated for all 15 classifiers on each test case and the classifier which is most confidently positive "wins".i.e. the one with the highest score.
Experimental design extra credit:Experimented with many different vocabulary sizes and performance is reported below in RESULTS.
Experimental design extra credit:Randomly pick 100 training and 100 testing images for each iteration, average performance and standard deviations are reported below in RESULTS.
%example code- get_tiny_images
image=imread(char(image_paths(i)));
image_feats(i,:)=reshape(imresize(image,[16,16]),1,256);
%example code- nearest_neighbor_classify
Distances = vl_alldist2( test_image_feats.',train_image_feats.');
[match, I_match]=sort(Distances,2);
k=1; %taking k nearest neighbours
nearest_neighbours=train_labels(I_match(:,1:k));
%example code- build_vocabulary
[locations, SIFT_ind] = vl_dsift(single(image),'STEP',5,'Fast');
%example code- get_bags_of_sifts
locations, SIFT_feat] = vl_dsift(image,'STEP',10);
Distances = vl_alldist2(vocab_T,single(SIFT_feat));
[minD,c]=min(Distances); %c=1Xvocab_size
binranges=1:K;
[histogram]=histc(c,binranges);
%example code- svm_classify
labels1=strcmp(categories(i),train_labels);
labels2(find(labels1))=1;
[W(:,i), B(i)] = vl_svmtrain(single(train_image_feats'), labels2', lambda);
Category name | Accuracy | Sample training images | Sample true positives | False positives with true label | False negatives with wrong predicted label | ||||
---|---|---|---|---|---|---|---|---|---|
Kitchen | 0.590 | Bedroom |
Store |
Office |
TallBuilding |
||||
Store | 0.520 | InsideCity |
Bedroom |
Highway |
Mountain |
||||
Bedroom | 0.390 | Industrial |
LivingRoom |
LivingRoom |
Mountain |
||||
LivingRoom | 0.110 | Bedroom |
Store |
Industrial |
Office |
||||
Office | 0.970 | Store |
Bedroom |
Kitchen |
Kitchen |
||||
Industrial | 0.370 | LivingRoom |
LivingRoom |
Bedroom |
Highway |
||||
Suburb | 0.980 | LivingRoom |
Mountain |
InsideCity |
TallBuilding |
||||
InsideCity | 0.660 | Street |
Industrial |
TallBuilding |
Street |
||||
TallBuilding | 0.780 | Industrial |
Kitchen |
Mountain |
Street |
||||
Street | 0.630 | Store |
InsideCity |
InsideCity |
Industrial |
||||
Highway | 0.760 | Street |
OpenCountry |
Coast |
OpenCountry |
||||
OpenCountry | 0.390 | Coast |
Coast |
Bedroom |
Coast |
||||
Coast | 0.830 | OpenCountry |
Highway |
OpenCountry |
Suburb |
||||
Mountain | 0.820 | Store |
OpenCountry |
Highway |
Suburb |
||||
Forest | 0.930 | OpenCountry |
OpenCountry |
Mountain |
Mountain |
||||
Category name | Accuracy | Sample training images | Sample true positives | False positives with true label | False negatives with wrong predicted label |
Tiny images representation and nearest neighbor classifier: Accuracy = 0.191 (1-NN)
Bag of SIFT representation and nearest neighbor classifier: Accuracy = 0.503 (Step size=3 in vl_dsift, no extra credit)
Bag of SIFT representation and linear SVM classifier: Accuracy = 0.649 (vocab_size=200, Step=2 in vl_dsift)
The variation of accuracy with respect to k is:
The variation of accuracy with respect to k is:
The variation of accuracy with respect to step-size is:
The variation of accuracy with respect to lambda is:
The variation of accuracy with respect to vocab_size is:
The mean of the accuracy is: 0.241
The standard deviation of the accuracy is: 0.135
Since we are taking only 100 training images, the accuracy is lower than that for 1500 training images
I calculated the histograms for each of the 4 quadrants of the image, in addition to 1 histogram of the full original image. The histograms are concatenated together, giving a total number of bins=5 X number of clusters
Step size = 10 in get_bags_of_sifts:
Original accuracy for bags_of_sifts+SVM = 0.621
Accuracy with spatial information included = 0.643(for 4 quadrants of image i.e.2 X 2).
The weights are calculated as 1/d^2 between each visual word and cluster centre(d=distance). Each visual word now contributes 'weight' value to each histogram bin instead of one for the nearest cluster centre
The accuracy for step size of 10 in get_bags_of_sifts is 0.433.
Since we are taking only 100 training images, the accuracy is lower than that for 1500 training images