CS 6476 Project 4: Scene Recognition with Bag of Words

Ian Buckley

October 26, 2016

1 Introduction

The purpose of this project was to implement scene recognition. Feature encoding and image classification were the primary components of this project. Three distinct feature encoding techniques were used: Tiny Image feature encoding, Bag of SIFT Visual Words, and Fisher feature encoding. Two classifiers, KNN and linear SVM, were used to classify images into categories.

2 Approach

Tiny Images were the first feature encoding used. To generate Tiny Image features, the resolution of input images was reduced to 16×16 pixels, and the rows were concatenated to form the feature vector. Using this simple feature encoding, the classifiers were implemented. Rather than strictly using nearest neighbour classification, KNN was implemented. Linear SVM was also implemented. Based on the notion of a bag of words, SIFT features were used to construct a visual vocabulary to represent images. A bag of SIFT visual words was generated for each image and was classified; soft assignment of visual words was implemented. Lastly, the Fisher feature encoding was implemented using Gaussian Mixture Model (GMM) to create the visual vocabulary. After implementing the feature encoders, a subset of the data was partitioned on which tuning was performed; this tuning was validated using repeated cross validation.

To summarize, the following extra credit was completed:

3 Results

The results of the following image classifications are presented briefly on this page and in detail in the links provided. The parameter tunings reported in the following sections were determined using the subset of the data set aside for parameter tuning. Parameters were tuned on this subset rather than the entire dataset to avoid overfitting the parameter values to the data.

Tiny Images Tiny image feature encoding with the 3-NN classifier achieved an accuracy of 0.227. The results of Tiny Images are found on the following page:Tiny Image Features with 3-NN.

Bag of Sift Features The vocabulary for the bag of SIFT visual words feature encoding used 200 visual words that were identified using K-Means Clustering–a step size of 5 was used in generating the vocabulary. Soft assignment of visual words in the training and testing data was used. Determined using the subset of the data reserved for tuning, the histogram indexed by the N=3 closest visual words was incremented by the distance to the centroid representing the visual word and scaled by the inverse total distance from those N=3 words–this modification of the soft assignment procedure was found to perform better than the same procedure for N=200. The bag of SIFT features (step size of 10) achieved an accuracy of 0.503 with the 3-NN classifier and an accuracy of 0.603 with the linear SVM (λ = 0.0001). The results of these classifications can be viewed at: Bag of SIFT Visual Words with 3-NN and Bag of SIFT Visual Words with SVM.

Fisher Feature Encoding The vocabulary for the Fisher feature encoding was generated using GMM with 50 clusters and SIFT step size of 100.The Fisher feature (SIFT feature descriptor step size 100) encoding achieved an accuracy of 0.445 with the 3-NN classifier and an accuracy of 0.6867 with the linear SVM (λ = 0.0001). The results of these classifications can be viewed at: Fisher Feature Encoding with 3-NN and Fisher Feature Encoding with SVM.

Cross Validation The accuracies presented above do not use the repeated random cross validation because this method required that training/test be a random subset of the total number of possible images, which resulted in consistent but lower accuracies. Standard deviations on the order of 0.02 were consistently observed for the bag of SIFT visual words feature encoding with both the 3-NN and SVM classifiers; an average (15 folds) accuracy of 0.56 was observed for the bag of SIFT visual words with the SVM classifier. These numbers are presented separately so as not to distract from the accuracies that will be used to evaluate the result of this project.

4 Conclusion

Image classification was successful. Using Tiny Images and the KNN classifier, image classification was a significant improvement on random classification. Better yet was the bag of SIFT visual words feature encoder with KNN–soft assignment of visual words was implemented. SVM was able to further improve classification using the bag of SIFT visual words. Improvements to the bag of SIFT visual words was achieved by using the Fisher feature encoder with SVM. In total, the highest accuracy of 0.6867 was achieved by using the Fisher feature encoder with linear SVM. Repeated random cross validation of the results was implemented and performed, but because the training set was reduced, the performance of the classification was degraded. During experimentation, a subset of the data was used to tune the parameters. While this was successful and the performance of the classification exceeds the minimum standards, further tuning can be performed to modestly increase the accuracy.