CS 6476: Project 4 / Scene Recognition with Bag of Words

Overview

In this project, we experiment with a dataset of 3000 images labeled across fifteen different scene categories (e.g. Coast, Forest, Office, Mountain, etc.) and explore several methods for feature representation. Once we have obtained feature vectors to describe each training image, the base objective for this project is to analyze two supervised learning approaches for classifying the test images: k-nearest neighbors and linear support vector machines.

The following pipelines (feature representation + supervised classifier) were implemented and evaluated on 1500/1500 training and testing images:

  1. Tiny images + kNN
  2. Bag of SIFT + kNN
  3. Bag of SIFT + Linear SVM
  4. Bag of SIFT + Nonlinear SVMS (radial basis function, sigmoid, intersection kernels)
  5. Bag of SIFT + GIST + Radial Basis Kernel SVMs

By implementing the last pipeline and meticulously finetuning SVM and bag of SIFT hyperparameters, we obtain a final accuracy of 83.5%.

Tiny Images + kNN

The first technique we use for scene recognition involves resizing the images to a small, fixed resolution (16x16) and then normalizing the features down to zero mean and unit variance. Next, we apply the k-nearest neighbors algorithm to assign a scene label for each image in the test set with the majority label of its k-nearest neighbors in the training set.

Using the tiny images representation and k-NN with k = 1, the accuracy achieved on the 50-50 train/test split was 22.5%.

Bag of SIFT + kNN

A disadvantage of the tiny images approach is that the low resolution and high noise in the images are challenges when accurate results are desired, and thus this method is not suitable for practical applications. To address some of these shortcomings, we opt to use a bag-of-words model, where the words are comprised of SIFT feature vectors, combined with the k-NN classifier for scene labeling.

The first step of this technique is to build a visual vocabulary on the training dataset by randomly sampling SIFT descriptors from each training image. Using k-means clustering on the randomly sampled descriptors, we find some number of centroids so that the within-cluster sum of squares (WCSS) is minimized from each training point to their respective centers. Next, a k-d tree is constructed on the visual vocabulary for efficient searching, and we proceed by sampling local SIFT features from each image (in train/test data). This is followed by associating each feature vector with its nearest visual centroid, and we count the occurrence of each visual word and consolidate these counts into a histogram. Finally, the histogram is normalized and added as a feature to represent the image.

Now that we have represented each image in the training and test sets as a normalized histogram of visual codewords, we can perform k-NN to predict the scene labels for each test image. Programmatically, we specify the number of clusters with the variable vocab_size, which by default is set to 400. Further, we can modify the step size of sampling SIFT features with smaller values indicating denser sampling. By constructing a vocabulary set with 400 visual words (sampled with step size 15), and setting k=7, the bag of SIFT + kNN pipeline obtained a test accuracy of 56.6%.

Fig. 1: Test accuracies obtained from using tiny images and bag of sift features + kNN.
(Defaults: vocab_size = 400, sample_size = 30, bin_size = 3, vocab building step_size = 15, local sift description step_size = 5)

Bag of SIFT + Linear SVM

Since k-NN has a tendency to overfit for small values of k, we now turn to training multiple 1-vs.-all linear SVMs, one for each of the 15 scene labels. The linear SVM constructing a maximum margin hyperplane that linearly separates the desired scene label from all other examples. During test time, we evaluate the test image on all of the 15 1-vs-all linear SVMS. If the feature space has dimension d, the class label is assigned corresponding to the SVM with the maximum score, computed as

w is a 1xd vector of trained weights, x is the test image represented in the d-dimensional feature space, b is the dx1 bias vector.

Performance on the 50-50 train/test split using the bag of sift features + linear SVM approach gives a test accuracy of 70.9%.

Fig. 2: Confusion matrix with bag of sift features + linear SVM.
(Defaults: vocab_size = 400. SVM Configuration: lambda = .0001)

Extra Credit #1: Bag of SIFT + Nonlinear SVMs

Sometimes the data cannot be separated by a simple hyperplane, so the data must be projected onto a higher dimensional feature space that is linearly separable. This can be done by applying a nonlinear kernel function K(xi, xj) between feature vectors in the SVM decision function

Similar to classifying with the linear SVM, labels are assigned based on which side of the nonlinear decision boundary the test point falls. We test some of the kernels commonly used in machine learning literature:

  1. (Gaussian) Radial Basis Function:
    The scale parameter gamma is inversely proportional to the standard deviation of the Gaussian. By making gamma large, two feature vectors must be closer together in order to be considered similar. In MATLAB, training a single 1-vs-all SVM with a RBF kernel can be done using the function
    
    		model = fitcsvm(X,Y,'KernelFunction','rbf','KernelScale', 'auto', 'BoxConstraint', 15, 'Standardize',true);
    		
    The function above automatically selects an optimal value for the kernel scale parameter using a heuristic procedure, and 'BoxConstraint' (also known as the regularization parameter C) assigns a penalty for training examples that fall on the wrong side of the nonlinear boundary. We fix this regularization parameter to 15 based on preliminary testing on the 15-scene dataset.
  2. Sigmoid:
  3. For training, we fix the parameters alpha=1/128 and c=-1.

  4. Histogram Intersection:

The following table displays the results of training on the three nonlinear kernels with the Bag of SIFT feature representation:

Kernel Type
RBF Sigmoid Intersection
Accuracy .749 .621 .648

Fig. 3: Test accuracies using Bag of SIFT + Nonlinear SVM
(Defaults: vocab_size = 400, 1500 training & 1500 test examples.)

Impressive, by tuning the hyperparameters of an RBF SVM (namely gamma and 'BoxConstraint'), we notice a 4% improvement over the linear SVM result obtained previously on the 50/50 train-test split.

Extra Credit #2: Vocabulary Sizes and Test Performance

As the vocabulary size increases, we observe a noticeable trend that the test accuracy on the 50-50 train/test split increases up to a certain size and falls afterwards. A vocabulary size of 1000 in combination with the bag of SIFT + RBF SVM configuration from the previous section shows a 1.6% improvement for a test accuracy of 76.3%. As one might expect, a larger codeword size is desirable because it offers a richer description of the feature space to which components of an image belong. However, if the vocabulary size becomes too large, we run into the problem of the curse of dimensionality in that similar local features are now improperly assigned to different cluster centers when they would have originally been grouped together, leading to overfitting of the feature representation and thus causing the test accuracy to suffer.

Test Accuracy
Vocabulary Size Bag of SIFT + RBF SVM Bag of SIFT + Sigmoid SVM
10 0.405 0.424
20 0.513 0.508
50 0.635 0.588
100 0.692 0.659
200 0.722 0.697
400 0.749 0.621
1000 0.763 0.430
10000 0.706 0.256

Fig. 4: Nonlinear SVM Performance with varying SIFT vocabulary sizes.

So the question is, can we do better?

Extra Credit #3: GIST Descriptors

Since the bag of words model does not preserve the spatial layout of the features, we turn to the GIST descriptor, which encodes a scene as a combination of several properties (openness, expansion, symmetry, etc.) and together form the "spatial envelope" of the scene. The rationale behind this representation is that two images with similar spatial envelopes are likely to be classified as the same scene. We augment the bag of SIFT features by concatenating a 512-dimensional GIST descriptor with the normalized histogram vector obtained for every training/test image.

For example, building a visual codebook of size 400 in combination with constructing GIST features yields a 912-dimensional feature vector for each image in the dataset. Using my proposed Bag of SIFT + GIST + RBF SVM pipeline on the scene dataset, we obtain a final test accuracy of 83.5%.

Test Accuracy
Vocabulary Size Bag of SIFT + GIST + RBF SVM
400 0.829
800 0.835

Fig. 5: Performance with Bag of SIFT + GIST for feature representation and RBF SVM classifier.

Bag of SIFT parameters: (sample_size = 30, bin_size = 3, vocab building step_size = 15, local sift description step_size = 5)
GIST parameters: (orientationsPerScale = [8 8 8 8], prefilt_fc = 4, numberBlocks = 4, imageSize = 128)
RBF parameters: ('KernelScale': 'auto', 'BoxConstraint': 15, 'Standardize': true)

Fig. 6: Visualization of the GIST descriptor .


Final Scene Classification Results

Scene classification results visualization


Accuracy (mean of diagonal of confusion matrix) is 0.835

Category name Accuracy Sample training images Sample true positives False positives with true label False negatives with wrong predicted label
Kitchen 0.700
LivingRoom

Office

LivingRoom

LivingRoom
Store 0.780
Kitchen

TallBuilding

LivingRoom

Industrial
Bedroom 0.660
LivingRoom

Kitchen

Kitchen

Office
LivingRoom 0.640
Office

Store

Bedroom

Bedroom
Office 0.940
Bedroom

Kitchen

LivingRoom

LivingRoom
Industrial 0.800
Street

Street

LivingRoom

Highway
Suburb 0.990
OpenCountry

TallBuilding
InsideCity 0.890
Industrial

TallBuilding

TallBuilding

Kitchen
TallBuilding 0.870
Industrial

Suburb

InsideCity

InsideCity
Street 0.890
LivingRoom

LivingRoom

InsideCity

Mountain
Highway 0.920
Store

OpenCountry

InsideCity

LivingRoom
OpenCountry 0.730
Coast

Coast

Coast

Highway
Coast 0.810
Highway

OpenCountry

OpenCountry

OpenCountry
Mountain 0.930
OpenCountry

Forest

OpenCountry

Coast
Forest 0.970
OpenCountry

OpenCountry

Mountain

Mountain
Category name Accuracy Sample training images Sample true positives False positives with true label False negatives with wrong predicted label