Project 4 / Scene Recognition with Bag of Words

The goal of this project is to introduce us to image recognition. Specifically, we will examine the task of scene recognition starting with very simple methods -- tiny images and nearest neighbor classification -- and then move on to more advanced methods -- bags of quantized local features and linear classifiers learned by support vector machines. The pipeline is run for following method:

  1. Tiny images representation and nearest neighbor classifier (accuracy of about 21.5%).
  2. Bag of SIFT representation and nearest neighbor classifier (accuracy of about 58%).
  3. Bag of SIFT representation and linear SVM classifier (accuracy of about 71.3%).

Analysis

Tiny images is one of the simplest possible image representations. The image is downsampled to 16x16 and this 256 vector is used as a feature vector. This is not a particularly good representation, because it discards all of the high frequency image content and is not especially invariant to spatial or brightness shifts.
Nearest neighbor and KNN is implemented using KDTREE that optimises the search time in logarithmic terms.
Bags of SIFT uses VL_SIFT that samples dense SIFT points and build a histogram by finding the nearest neighbor KDTREE centroid for every SIFT feature. Implementation for Bag of SIFT features is done with cluster size 200.
Linear SVM uses VL_SVM. The classifiers are tried over different values of lambda.

Highest Accuracy

Current highest accuracy achieved 78% is with Fisher encoding for cluster size 200 + RBF Kernel Non Linear SVM with lambda = 10^-4


Extra Credits Implemented


  1. Vocabulary Size Analysis
  2. Cross Validation Method
  3. Scale-spaced features
  4. Complementary features (GIST)
  5. Fisher Encoding
  6. Gaussian/RBF Kernel
  7. Spatial Pyramids

Vocabulary Size Analysis

Analysis of different feature types over classification algorithms and different cluster sizes. The code was run for cluster size 400 also, but considering time and accuracy trade off I choose not to display the results.

Cross Validation

Cross Validation is run for N= 100, 200, 1000 iterations with Bags of SIFT features and Linear SVM reporting highest accuracy of 68%.

Scale-spaced features

We notice that multiple scale features (over 4 scales here) provide not much of a change in final accuracy.

Complementary features

We implement GIST features over the three classification algorithms and notice sudden spike in accuracy levels. This can be because GIST provides a low dimensional representation of the scene, which does not require any form of segmentation. This concatenated with bag of words sift features provides high accuracy on cluster sizes 50-200

Fisher Encoding

For K=200, Fisher encoding gave high results ~ 78% with gaussian mixture model trained with 200 clusters.

Gaussian/RBF Kernel

Primal SVM code is used with a self-implemented RBF kernel.

Spatial Pyramid

Features are concatenated over Levels 0 and 1 to give comparable results to GIST descriptor features for cluster sizes 200 and 150. and 150.

Results visualization for poorly performing recognition pipeline.


Accuracy (mean of diagonal of confusion matrix) is 0.261

Category name Accuracy Sample training images Sample true positives False positives with true label False negatives with wrong predicted label
Kitchen 0.480
Industrial

Forest

Highway

Forest
Store 0.340
Office

TallBuilding

TallBuilding

Coast
Bedroom 0.300
InsideCity

Store

OpenCountry

Mountain
LivingRoom 0.250
Office

Kitchen

Bedroom

Office
Office 0.590
Mountain

Bedroom

Industrial

InsideCity
Industrial 0.330
LivingRoom

TallBuilding

InsideCity

Store
Suburb 0.570
Coast

InsideCity

Kitchen

Street
InsideCity 0.300
Suburb

Coast

TallBuilding

Store
TallBuilding 0.580
Store

Forest

LivingRoom

Coast
Street 0.500
Kitchen

Store

Highway

Suburb
Highway 0.750
Suburb

Industrial

Coast

Mountain
OpenCountry 0.310
Forest

Kitchen

Suburb

Coast
Coast 0.670
Mountain

Mountain

InsideCity

OpenCountry
Mountain 0.700
Store

Bedroom

Kitchen

Forest
Forest 0.640
Mountain

Street

Suburb

Kitchen
Category name Accuracy Sample training images Sample true positives False positives with true label False negatives with wrong predicted label