Project 4 / Scene Recognition with Bag of Words

The goal of this project is to perform scene recognition. Tiny Images and Bags of Sift features were the two image representations used. They were used along with classification techniques (i) nearest neighbor (ii) svm

  1. Tiny image representation and nearest neighbour classifier
  2. Bag of SIFT representation and nearest neighbor classifier
  3. Bag of SIFT reprentation and linear SVM classifier

Image representations

  • Tiny image representation :In the tiny image representation, each image is resized to a small, fixed resolution (16x16 was used). The tiny image is made to have zero mean and unit length.
  • Bag of SIFT features : Before we can represent SIFT features for our training or testing examples, we first need to establish a vocabulary of visual words.This is done by sampling SIFT descriptors from the training images(to save computation time and speed up the clustering process), clustering them with kmeans, and then returning the cluster centers, which forms the visual vocabulary.A sampling number of 2000 is taken. A higher sampling number might improve the accuracy but increases the computational time In bag of SIFT, The SIFT features for the testing examples are initially computed and then each local feature is assigned to its nearest cluster center by comparing it with the vocabulary that already exists. A histogram indicating how many times each cluster was used forms the feature representation for each image. The histogram is normalized so that a larger image with more SIFT features will not look very different from a smaller version of the same image.
  • Classification techniques

  • Nearest neighbour classifier : The nearest neighbour classifier will predict the category for every test image by finding the training image with most similar features. Instead of 1 nearest neighbor, vote can be based on k nearest neighbors which will increase performance.
  • Linear SVM : In a linear classifier, the feature space is partitioned by a learned hyperplane and test cases are catefgorized based on which side of the hyperplane they fall on. This function will train a linear SVM for every category (15 categories here) and then use the learned linear classifiers to predict the category of every test image. Every test feature will be evaluated with all 15 SVMs and the most confident SVM will "win". There is a parameter called lambda which greatly affects the accuracy. A lambda of 0.00001 showed a maximum accuracy.
  • Accuracy of the following combinations

  • Tiny images representation and k nearest neighbour classifier : 18.4%
  • Bag of SIFT representation and nearest neighbour classifier : 50.6%
  • Tiny images representation and linear SVM classifier(lambda=0.0001) : 65.3%
  • Results

    Tiny image + K Nearest Neighbor

    Bag of SIFT + K Nearest Neighbor

    Bag of SIFT + linear SVM