Project 4 / Scene Recognition with Bag of Words

Example scenes from of each category in the 15 scene dataset. Figure from Lazebnik et al. 2006.

My scene recognition pipeline has two parts. I used the 15 scene dataset and implemented the following:

Part 1: Feature Extraction

This consists of going through the training and test sets and extracting the relevant features. I experimented with three feature representations:

Tiny image representation. Originally created by Torralba, Fergus, and Freeman. I simply resized each image to 16x16 and used the 256 pixels as a feature. I experimented with normalizing the features and making their mean 0.
SIFT Features
gist descripters. I used the gist decripters originally implemented by Aude Oliva, Antonio Torralba.

Part 2: Machine Learning

With the features in hand, I set out to classify them. I used two classification algorithms:

Nearest Neighbor
Linear Support Vector Machine

Experiments

I performed the following experiments:

Tiny Image with Nearest Neighbors
Here, I ran tiny image with slight variations and then used nearest neighbors on those features. The variations are:
- Just 256 pixels. The results were:
```
Using tiny image representation for images
Elapsed time is 12.498010 seconds.
Using nearest neighbor classifier to predict test set categories
Elapsed time is 1.961715 seconds.
Accuracy = 0.191
```
  This was expected. You can see the full results here.
- Just 256 pixels, but with normalization. I normalized the features to unit length. The results were:
```
Using tiny image representation for images
Elapsed time is 12.452074 seconds
Using nearest neighbor classifier to predict test set categories
Elapsed time is 1.543277 seconds.
Accuracy = 0.199
```
  This was expected. I got a small boost in performance (0.199 vs 0.191).
  I did not expect it to take slightly less time, although I might be reading too much into it.
  You can see the full results here.
- Just 256 pixels, but with normalization and zero mean. I normalized the features to unit length and set the mean to 0 by subtracting the old mean from each element in the feature. The results were:
```
Using tiny image representation for images
Elapsed time is 12.946300 seconds.
Using nearest neighbor classifier to predict test set categories
Elapsed time is 1.083529 seconds.
Accuracy = 0.147
```
  This was very unexpected. I got a considerable loss in performance (0.147 vs 0.191) and it took longer to run.
  You can see the full results here.

Bag of SIFT with Nearest Neighbors
I first ran SIFT with a vocab size of 200 and step size of 10. Results:
```
Using bag of sift representation for images
Elapsed time is 244.782265 seconds.
Using nearest neighbor classifier to predict test set categories
Elapsed time is 2.066009 seconds.
Accuracy = 0.513
```
Confusion matrix
This was expected. As you can see, the creating a vocabulary and extracting SIFT features too about 4 minutes.
I got an accuracy of 51%

Bag of SIFT with Linear SVMs

Again, I used a vocab size of 200 and step size of 10. I varied Lambda to get the best performance.
First I went through orders of magnitude, going from 0.1 to 10^-7. The results were:

Lambda	10^-1	10^-2	10^-3	10^-4	10^-5	10^-6	10^-7
Accuracy	0.4413	0.5387	0.5853	0.6313	0.5853	0.5507	0.5133
Time (s)	0.8738	0.7585	0.9060	1.470	2.2398	6.6108	26.5470

Plotting these:

As you can see, there is a clear peak at 10^-4. So I started fine tuning.
I ran it again but with Lambda varying from 0.0001 to 0.0009 in increments of 0.0001.

Lambda	0.0001	0.0002	0.0003	0.0004	0.0005	0.0006	0.0007	0.0008	0.0009
Accuracy	0.6287	0.6247	0.6180	0.6100	0.5987	0.5987	0.5967	0.6027	0.5953
Time (s)	1.212	1.0172	0.8319	0.8044	0.8769	0.8388	0.8402	0.7931	1.0674

Plotting these:

Since 0.0001 performed the best, I used that value for future experiments.

After this, I varied the vocab size. I experimented with parallel processing by using parpool to create a cluster locally and then using parfor loops. The result was

Size	10	20	50	100	200	400	1000
Accuracy	0.490	0.391	0.616	0.590	0.632	0.634	0.631

Here is the confusion matrix for Lambda = 0.0002 and Vocab Size = 400:

The accuracy was 0.627 .You can see the full results here.

gist with Nearest Neighbors

Using only gist features, I got a small increase in accuracy. Results:


Using bag of gist representation for images
Elapsed time is 308.598544 seconds.
Using nearest neighbor classifier to predict test set categories
Elapsed time is 4.023655 seconds.
Accuracy (mean of diagonal of confusion matrix) is 0.561

Confusion matrix

Creating gist features took longer to extract than SIFT (5 minutes vs 4 minutes). There was a 3% increase in accuracy.

gist with Linear SVMs

I varied Lambda to get the best performance.
First I went through orders of magnitude, going from 0.1 to 10^-7. The results were:

Lambda	10^-1	10^-2	10^-3	10^-4	10^-5	10^-6	10^-7
Accuracy	0.2927	0.5720	0.6513	0.6913	0.6807	0.6493	0.6367
Time (s)	0.9898	0.9484	1.0137	1.5346	4.9265	17.2567	23.6776

Plotting these (Note the Y axis in the first plot is accuracy, and time in the second plot.):

As you can see, there is a clear peak at 10^-4. So I started fine tuning.
I ran it again but with Lambda varying from 0.0001 to 0.0009 in increments of 0.0001.

Lambda	0.0001	0.0002	0.0003	0.0004	0.0005	0.0006	0.0007	0.0008	0.0009
Accuracy	0.6967	0.6967	0.6847	0.6673	0.6573	0.6767	0.6633	0.6520	0.6687
Time (s)	1.5756	1.268	1.1900	1.1118	1.1181	1.0523	1.1142	1.1592	1.201

Plotting these (Note the Y axis in the first plot is accuracy, and time in the second plot.):

Since 0.0001 and 0.0002 performed equally well and 0.0002 took less time, I used 0.0002 as my Lambda value for further experimetnts.
Here is the confusion matrix for Lambda = 0.0002:

The accuracy was 0.687 . This is the better than bag of SIFT by about 0.07. You can see the full results here.

Combining features

Thinking the more the merrier, I first combined all the three feature representations above and classified them using SVM.
the results were dismal. I got only 9%. The confusion matrix is:

I figured the tiny image representation might be skewing things on account of it's low information features. I combined gist and SIFT and the results were pretty good. I achieved around 77%. It achieved more than 90% accuracy for Suburb (95%), Street(91%) and Forest (92%)! The result was:

Scene classification results visualization

Accuracy (mean of diagonal of confusion matrix) is 0.768

Category name Accuracy Sample training images Sample true positives False positives with true label False negatives with wrong predicted label

Kitchen 0.650
Bedroom
Bedroom
Bedroom
Store

Store 0.660
Kitchen
Bedroom
TallBuilding
Kitchen

Bedroom 0.570
LivingRoom
LivingRoom
OpenCountry
LivingRoom

LivingRoom 0.450
Bedroom
Bedroom
Bedroom
Bedroom

Office 0.890
Kitchen
Bedroom
Kitchen
Kitchen

Industrial 0.750
Store
Bedroom
Kitchen
Store

Suburb 0.950
LivingRoom
Bedroom
Kitchen
InsideCity

InsideCity 0.790
Highway
Industrial
Kitchen
TallBuilding

TallBuilding 0.880
InsideCity
LivingRoom
InsideCity
Store

Street 0.910
Store
InsideCity
Mountain
InsideCity

Highway 0.830
Coast
OpenCountry
Coast
Bedroom

OpenCountry 0.590
Mountain
LivingRoom
Coast
Highway

Coast 0.830
Industrial
OpenCountry
OpenCountry
Suburb

Mountain 0.850
TallBuilding
Kitchen
Coast
Forest

Forest 0.920
Mountain
OpenCountry
Store
TallBuilding

Category name Accuracy Sample training images Sample true positives False positives with true label False negatives with wrong predicted label

Category name	Accuracy	Sample training images	Sample true positives	False positives with true label	False negatives with wrong predicted label
Kitchen	0.650					Bedroom	Bedroom	Bedroom	Store
Store	0.660					Kitchen	Bedroom	TallBuilding	Kitchen
Bedroom	0.570					LivingRoom	LivingRoom	OpenCountry	LivingRoom
LivingRoom	0.450					Bedroom	Bedroom	Bedroom	Bedroom
Office	0.890					Kitchen	Bedroom	Kitchen	Kitchen
Industrial	0.750					Store	Bedroom	Kitchen	Store
Suburb	0.950					LivingRoom	Bedroom	Kitchen	InsideCity
InsideCity	0.790					Highway	Industrial	Kitchen	TallBuilding
TallBuilding	0.880					InsideCity	LivingRoom	InsideCity	Store
Street	0.910					Store	InsideCity	Mountain	InsideCity
Highway	0.830					Coast	OpenCountry	Coast	Bedroom
OpenCountry	0.590					Mountain	LivingRoom	Coast	Highway
Coast	0.830					Industrial	OpenCountry	OpenCountry	Suburb
Mountain	0.850					TallBuilding	Kitchen	Coast	Forest
Forest	0.920					Mountain	OpenCountry	Store	TallBuilding
Category name	Accuracy	Sample training images	Sample true positives	False positives with true label	False negatives with wrong predicted label

Weird result

I tried using Olivier Chapelle's SVM MATLAB code with an RBF kernel but it gave me 100% accuracy every time. Something was clearly wrong. Here is the confusion matrix I got:

Results

In the end, I managed to achieve about 76.8% accuracy by combining gist and SIFT features. I could have gotten better results by accounting for different spatial scales, using more optimized values for stepsize, image size etc. and using different SVM kernels. I will look into them in the future. I also couldn't finish performing cross validation which would have given me a more honest idea of the accuracy of my model. I tried running the code on GT PACE High Performance Cluster but couldn't get it working.

Data

You can see the confusion matrices and other plots I made while performing the experiments here. I did not include all of them in this webpage for the sake of brevity.

Karan Shah