Project 4 / Scene Recognition with Bag of Words

The goal of this project was to explore three different pipelines for scene recognition in images.

Tiny Image Representation with Nearest Neighbor Classifier

The first pipeline tested was using the "tiny image" feature to describe images, followed by a nearest neighbor (1NN) classifier to match the test images with the training images. Implementing the tiny image representation was simply a matter of resizing each image to 16x16 pixels and vectorizing them so they can be used as featuers. The vector representation of the tiny image was also normalized to zero mean and unit length to improve accuracy.

The 1NN classifier was also fairly straightforward. The pairwise distances between training and test image features were computed and sorted. The top results would be the nearest neighbor in the set of training images to each test image. The category corresponding to the selected training image was then assigned to test image.

This pipeline achieved 23.8% accuracy. There were no parameters to tune in this pipeline.

Bag of SIFT Representation with Nearest Neighbor Classifier

The second pipeline tested was using the bag of SIFT features to represent images, followed by the same 1NN classifier from the first pipeline. In order for the bag of SIFT representation to work, a vocabulary of visual words had to be created. This was done by sampling tens of thousands (30,000 in my case) of SIFT features from the set of training images and then grouping them with kmeans clusutering. Larger samples were experimented with, but while they showed improved accuracy, it was insignificant and modestly increased the runtime.

Once the vocabulary was created, bags of SIFT features were constructed by sampling SIFT features from test images at a much denser rate than when building the vocabulary. For the sake of runtime, however, the bag was restricted up to a maximum number of features m; more features resulted in improved accuracy, however. The pairwise distances between the bag of SIFT features and vocabulary were computed and sorted. A histogram was then made from the number of instances in which each vocabulary visual word was selected as the nearest neighbor for every feature in the bag of SIFT features. Finally, the histogram was normalized and stored to represent each test image when it undergoes classification.

This pipeline achieved 51.4% accuracy with m = 20 in build_vocabulary.m, m = 1500 in get_bags_of_sifts.m, and vocab_size = 200 in proj4.m.

Bag of SIFT Representation with Linear Support Vector Machine Classifier

The third pipeline tested was using the bag of SIFT features from the second pipeline, followed by linear SVM classifiers. The task was to train 1-vs-all linear SVMs in to work in the bag of SIFT feature space. Each linear SVM corresponds to one of the 15 scene categories. The binary labels for each training image in every category (whether the test image matches the corresponding scene or not) were initialized to -1, and each training labels that matched the current categroy was set to 1. VLFeat's vl_svmtrain function was then used to train the linear SVM given training image features, labels, and a regularization parameter λ (some testing showed that 0.00005 appeared to give roughly optimal results). This function returns a vector of weights for each corresponding image feature and an offset used to calculate the confidence of each image in every scene category. Finally, each image was assigned its most confident scene category label.

This pipeline achieved 67.4% accuracy with m = 20 in build_vocabulary.m, m = 1500 in get_bags_of_sifts.m, vocab_size = 200 in proj4.m, and λ = 0.00005 in svm_classify.m.

Experimenting with Vocabulary Size

It was found that increasing vocabulary size improved accuracy to an extent, but the improvement tapers rapidly while dramatically increasing runtime to complete the full pipeline, including building the vocabulary.

Final scene classification results visualization

Accuracy (mean of diagonal of confusion matrix) is 69.7% with m = 20 in build_vocabulary.m, m = 1500 in get_bags_of_sifts.m, vocab_size = 10,000 in proj4.m, and λ = 0.00005 in svm_classify.m.

Category name Accuracy Sample training images Sample true positives False positives with true label False negatives with wrong predicted label

Kitchen 0.560
Bedroom
Industrial
Bedroom
Industrial

Store 0.610
Industrial
Bedroom
InsideCity
Kitchen

Bedroom 0.510
Office
Kitchen
LivingRoom
Kitchen

LivingRoom 0.480
Bedroom
Kitchen
Bedroom
Industrial

Office 0.860
Kitchen
Kitchen
Bedroom
LivingRoom

Industrial 0.540
Store
Street
InsideCity
TallBuilding

Suburb 0.950
OpenCountry
OpenCountry
OpenCountry
Coast

InsideCity 0.640
Street
Kitchen
Street
Bedroom

TallBuilding 0.830
Industrial
Street
Coast
Industrial

Street 0.650
InsideCity
Bedroom
Industrial
Mountain

Highway 0.820
LivingRoom
Street
Coast
OpenCountry

OpenCountry 0.470
Coast
Mountain
Mountain
Coast

Coast 0.790
OpenCountry
OpenCountry
OpenCountry
Mountain

Mountain 0.820
Forest
TallBuilding
Suburb
Street

Forest 0.930
Mountain
OpenCountry
Mountain
Mountain

Category name Accuracy Sample training images Sample true positives False positives with true label False negatives with wrong predicted label