Project 4 / Scene Recognition with Bag of Words

Introduction

The purpose of this project was to cover the topic of image recognition, a problem in computer vision that aims to distinguish classes of scenes. In the case of this project, the scene classes were as follows, ordered by similarity:

The core approaches for image recognition in this project are based around the idea of extracting strong, distinguishing features from an image to describe image classes to machine learning algorithms. By training machine learning algorithms with the features labeled by these image classes, the algorithm is eventually able to classify a given set of features corresponding to an image without a label. Specifically for the project, the following approaches were used in the image recognition pipeline:

Feature Extraction

Classifiers



Tiny Image Features + Nearest Neighbor

The simplest version of the pipeline resized the images to 16x16 and used the pixel intensity values from the resulting image to create a point for the nearest neighbor algorithm. The results were:


Accuracy (mean of diagonal of confusion matrix) is 0.191

The poor results were unsurprising, considering the nature of the tiny image approach to features. By resizing an image to a very small resolution, the algorithm still maintains the general idea of the image and preserves the low frequencies, but potentially important details at high frequencies are lost. With the loss of high frequencies, it becomes very difficult for the algorithm to distinguish between similar scenes when details are lost.



Bag of SIFT + Nearest Neighbor

To combat the poor image generalization and lossy nature of the tiny image's features, the bag of SIFT method was used instead. The accuracy was significantly improved:


Accuracy (mean of diagonal of confusion matrix) is 0.544

The improvement in the results were likely due to the vastly superior features that the bag of SIFT method provided. SIFT keypoints are much better at extracting unique and distinguishing points of interest of a scene and providing a robust descriptor that allows for that point to be matched across different scenes for similarity. With such a large improvement from the feature extraction, the focus shifted towards the learning algorithm, nearest neighbor. Nearest neighbor is a very intuitive learning algorithm when given the assumption that datapoints close together exhibit similar features and therefore belong to a similar class. However, by selecting just one neighbor, the algorithm is very susceptible to noise and can therefore overfit data aggressively.



Bag of SIFT + Support Vector Machine

Finally, the last major improvement of the standard scene recognition pipeline was to use support vector machines to learn and classify the features. The classification performance was once again improved:


Accuracy (mean of diagonal of confusion matrix) is 0.649

With an improved learning algorithm, the pipeline was able to identify scenes even more accurately, even with some noise. Even so, there were still difficulties distinguishing between the store, bedroom, and living room. This made logical sense, since those scenes exhibit very similar visual cues and objects that possess many similar points of interest.



Bonus: Tuning Parameters with a Validation Set

To tune the parameters for the bag of SIFT + SVM setup, a validation set was used. A Python script, pick_validation_set.py, was written to randomly sample from the test set to create the validation set. The following validation set was used for tuning:


../data/test/Bedroom/image_0158.jpg
../data/test/Bedroom/image_0176.jpg
../data/test/Bedroom/image_0037.jpg
../data/test/Bedroom/image_0196.jpg
../data/test/Coast/image_0177.jpg
../data/test/Coast/image_0124.jpg
../data/test/Coast/image_0349.jpg
../data/test/Coast/image_0331.jpg
../data/test/Forest/image_0198.jpg
../data/test/Forest/image_0101.jpg
../data/test/Forest/image_0196.jpg
../data/test/Forest/image_0319.jpg
../data/test/Highway/image_0184.jpg
../data/test/Highway/image_0220.jpg
../data/test/Highway/image_0232.jpg
../data/test/Highway/image_0211.jpg
../data/test/Industrial/image_0178.jpg
../data/test/Industrial/image_0045.jpg
../data/test/Industrial/image_0123.jpg
../data/test/Industrial/image_0245.jpg
../data/test/InsideCity/image_0002.jpg
../data/test/InsideCity/image_0251.jpg
../data/test/InsideCity/image_0297.jpg
../data/test/InsideCity/image_0048.jpg
../data/test/Kitchen/image_0071.jpg
../data/test/Kitchen/image_0057.jpg
../data/test/Kitchen/image_0097.jpg
../data/test/Kitchen/image_0147.jpg
../data/test/LivingRoom/image_0126.jpg
../data/test/LivingRoom/image_0114.jpg
../data/test/LivingRoom/image_0156.jpg
../data/test/LivingRoom/image_0160.jpg
../data/test/Mountain/image_0186.jpg
../data/test/Mountain/image_0254.jpg
../data/test/Mountain/image_0297.jpg
../data/test/Mountain/image_0146.jpg
../data/test/Office/image_0180.jpg
../data/test/Office/image_0089.jpg
../data/test/Office/image_0047.jpg
../data/test/Office/image_0121.jpg
../data/test/OpenCountry/image_0008.jpg
../data/test/OpenCountry/image_0109.jpg
../data/test/OpenCountry/image_0189.jpg
../data/test/OpenCountry/image_0164.jpg
../data/test/Store/image_0296.jpg
../data/test/Store/image_0259.jpg
../data/test/Store/image_0300.jpg
../data/test/Store/image_0166.jpg
../data/test/Street/image_0161.jpg
../data/test/Street/image_0232.jpg
../data/test/Street/image_0042.jpg
../data/test/Street/image_0106.jpg
../data/test/Suburb/image_0061.jpg
../data/test/Suburb/image_0213.jpg
../data/test/Suburb/image_0182.jpg
../data/test/Suburb/image_0138.jpg
../data/test/TallBuilding/image_0138.jpg
../data/test/TallBuilding/image_0136.jpg
../data/test/TallBuilding/image_0088.jpg
../data/test/TallBuilding/image_0221.jpg

The main parameter adjusted was lambda for the SVM, and the following are visualisations of how the experiments performed.


0.000001
Getting paths and labels for all train and test data
Using bag of sift representation for images
No existing visual word vocabulary found. Computing one from training images
Using support vector machine classifier to predict test set categories
Creating results_webpage/index.html, thumbnails, and confusion matrix
Accuracy (mean of diagonal of confusion matrix) is 0.600

0.00001
Getting paths and labels for all train and test data
Using bag of sift representation for images
Using support vector machine classifier to predict test set categories
Creating results_webpage/index.html, thumbnails, and confusion matrix
Accuracy (mean of diagonal of confusion matrix) is 0.687

0.000025
Getting paths and labels for all train and test data
Using bag of sift representation for images
No existing visual word vocabulary found. Computing one from training images
Using support vector machine classifier to predict test set categories
Creating results_webpage/index.html, thumbnails, and confusion matrix
Accuracy (mean of diagonal of confusion matrix) is 0.691

0.00005
Getting paths and labels for all train and test data
Using bag of sift representation for images
Using support vector machine classifier to predict test set categories
Creating results_webpage/index.html, thumbnails, and confusion matrix
Accuracy (mean of diagonal of confusion matrix) is 0.649

0.000075
Getting paths and labels for all train and test data
Using bag of sift representation for images
Using support vector machine classifier to predict test set categories
Creating results_webpage/index.html, thumbnails, and confusion matrix
Accuracy (mean of diagonal of confusion matrix) is 0.693

0.0001
Getting paths and labels for all train and test data
Using bag of sift representation for images
Using support vector machine classifier to predict test set categories
Creating results_webpage/index.html, thumbnails, and confusion matrix
Accuracy (mean of diagonal of confusion matrix) is 0.673

0.001
Getting paths and labels for all train and test data
Using bag of sift representation for images
Using support vector machine classifier to predict test set categories
Creating results_webpage/index.html, thumbnails, and confusion matrix
Accuracy (mean of diagonal of confusion matrix) is 0.611

0.01
Getting paths and labels for all train and test data
Using bag of sift representation for images
Using support vector machine classifier to predict test set categories
Creating results_webpage/index.html, thumbnails, and confusion matrix
Accuracy (mean of diagonal of confusion matrix) is 0.522

From the data above, it is clear that there exists a sweet spot between 0.00001 and 0.0001 lambda values, and from the experiments, the best value was 0.00075, providing an accuracy value of 0.693.



Bonus: Varying Vocabulary Size

All of the previous results used only 200 vocabulary words. This part of the project aimed to experiment with the effect of the vocabulary size on classification and recognition performance.

From the data above, the vocabulary size clearly affects accuracy in a positive way as the vocabulary size gets larger. However, the time it takes to complete the pipeline grows much faster than a linear relationship, so there comes a point above 700 words or so that the accuracy for time tradeoff is not worth it. In fact, the experiment was run on 5000 words, and after 4 hours, the code was still running. In conclusion, there is a optimal spot that trades off well between accuracy and time around 700 words.



Best Results

After all the experimentation, here are the best results, with an accuracy of 0.717.

Scene classification results visualization


Accuracy (mean of diagonal of confusion matrix) is 0.717

Category name Accuracy Sample training images Sample true positives False positives with true label False negatives with wrong predicted label
Kitchen 0.520
Bedroom

Bedroom

Bedroom

LivingRoom
Store 0.570
InsideCity

Street

LivingRoom

LivingRoom
Bedroom 0.500
LivingRoom

Kitchen

Kitchen

LivingRoom
LivingRoom 0.510
Street

Store

Bedroom

Store
Office 0.930
LivingRoom

Industrial

Kitchen

LivingRoom
Industrial 0.560
Highway

InsideCity

TallBuilding

TallBuilding
Suburb 0.980
LivingRoom

Coast

OpenCountry

InsideCity
InsideCity 0.560
Street

TallBuilding

Industrial

Industrial
TallBuilding 0.850
InsideCity

Bedroom

Industrial

LivingRoom
Street 0.770
Highway

InsideCity

InsideCity

Store
Highway 0.850
Coast

OpenCountry

Coast

Industrial
OpenCountry 0.550
Mountain

Industrial

Coast

Highway
Coast 0.780
OpenCountry

TallBuilding

OpenCountry

OpenCountry
Mountain 0.860
Coast

Forest

LivingRoom

TallBuilding
Forest 0.960
Store

Mountain

Store

Mountain
Category name Accuracy Sample training images Sample true positives False positives with true label False negatives with wrong predicted label