Project 4 / Scene Recognition with Bag of Words

The goal of this project was Image Recognition/Classification. Experiments were done on the 15 scene dataset.

The process is split into three parts -

  1. Representing Image Features
  2. Creating a Classification Model using pre-labeled images
  3. Using the model to perform inference to predict labels for an unseen test set

Two formats of image feature represenation were explored -
Tiny Images: Resizing the image to 16x16 and reshaping it to a 1x256 vector.
Bag of words: Using SIFT descriptors ans sorting them into bins of precomputed visual words.

Similarly, two prediciton models were used -
K-Nearest Neighbors: Clustering based on the nearest fit in the training set.
SVM: A linear classifier that essentialy splits the feature space using a hyperplane.

The baseline results (picking randomly) results in ~7% accuracy.

Tiny Images and Nearest Neighbor

Using the tiny images feature representation and nearest neighbor classifier (k = 1), I obtained a 19.1% acuracy. This works by assigning calculating distances between a test feature and corresponding train features, and assigning it the label that it's closest match (by L2 disance) has. Detailed results here.

When these tiny image vectors were made zero mean and normalized, performance increased to 22.5%. Detailed results here.

Bag of Words and Nearest Neighbor

The first step for the SIFT Bag of Words representation was to build a vocabulary. of visual words. This was done by sampling features from the training set, and then clustering these to obtain the required number of words (k = vocab_size = number of words). This vocabulary represents what our desired features look like. Now, for each SIFT feature computed for an image, the distances to all corresponding features in the vocabulary are computed, and the minimum picked to assign it that word. This forms a "bag of words" which represents the image as a collection of visual features. The nearest neighbor in the training set for each bag of words feature is picked, and its label applied to the input image.
This process resulted in an accuracy of 51.1% using step sizes of 10 for SIFT. Detailed results here.

Bag of Words and Support Vector Machines

In this scenario, an SVM linear classifier was used to do the prediction. An SVM classifier was trained on each category, resulting in 15 classifiers. For each image, the confidence score with all 15 hyperplanes was evaluated, and the best one was picked as the winner (the resulting category).
This process resulted in an accuracy of 63.7% using lambda = 0.0001.

Detailed Results:

Scene classification results visualization

Accuracy (mean of diagonal of confusion matrix) is 0.637

Category name Accuracy Sample training images Sample true positives False positives with true label False negatives with wrong predicted label
Kitchen 0.550
LivingRoom

Store

InsideCity

LivingRoom
Store 0.500
Street

Highway

Mountain

LivingRoom
Bedroom 0.350
LivingRoom

Kitchen

Office

Office
LivingRoom 0.310
Store

Bedroom

Suburb

Bedroom
Office 0.880
Store

LivingRoom

Store

LivingRoom
Industrial 0.450
Kitchen

TallBuilding

Store

TallBuilding
Suburb 0.910
InsideCity

InsideCity

TallBuilding

Mountain
InsideCity 0.450
Store

Store

Industrial

Street
TallBuilding 0.760
Bedroom

InsideCity

Bedroom

Coast
Street 0.760
InsideCity

InsideCity

Suburb

Mountain
Highway 0.750
OpenCountry

Industrial

Industrial

OpenCountry
OpenCountry 0.430
Suburb

Industrial

Mountain

Coast
Coast 0.800
OpenCountry

Highway

OpenCountry

OpenCountry
Mountain 0.770
Store

Forest

OpenCountry

Industrial
Forest 0.890
Store

Mountain

Suburb

Mountain
Category name Accuracy Sample training images Sample true positives False positives with true label False negatives with wrong predicted label

Further Experimentation

The bag-of-words + SVM scenario yielded the best results so far. I decided to experiment with two parameters - vocabulary size (for SIFT-BOW) and lambda (for SVM) to observe changes in performance.

Vocabulary Size

A constant lambda of 0.0001 for the SVM classifier was used here. Accuracy increased by increasing the size (for the most part), however computation time increases greatly as well. For extremely large vocabulary sizes, the model begins to overfit, looking at every tiny change as a visual word.
Vocabulary Size Accuracy (%)
10 40.9
50 57.3
100 60.4
200 63.7
400 66.7
1000 67.4
10000 62.2

Lambda

A constant vocabulary size of 200 for the Bag of Words SIFT features was used here.
Lambda Accuracy (%)
0.1 42.1
0.01 49.1
0.005 54.9
0.001 60.7
0.0005 62.5
0.0001 63.7
0.00001 59.2