Recognition with Bag of Words

The goal of this project is to introduce us to image recognition. Specifically, we will examine the task of scene recognition starting with very simple methods -- tiny images and nearest neighbor classification -- and then move on to more advanced methods -- bags of quantized local features and linear classifiers learned by support vector machines. The pipeline is run for following method:

Analysis

Tiny images is one of the simplest possible image representations. The image is downsampled to 16x16 and this 256 vector is used as a feature vector. This is not a particularly good representation, because it discards all of the high frequency image content and is not especially invariant to spatial or brightness shifts.
Nearest neighbor and KNN is implemented using KDTREE that optimises the search time in logarithmic terms.
Bags of SIFT uses VL_SIFT that samples dense SIFT points and build a histogram by finding the nearest neighbor KDTREE centroid for every SIFT feature. Implementation for Bag of SIFT features is done with cluster size 200.
Linear SVM uses VL_SVM. The classifiers are tried over different values of lambda.

Highest Accuracy

Current highest accuracy achieved 78% is with Fisher encoding for cluster size 200 + RBF Kernel Non Linear SVM with lambda = 10^-4

Extra Credits Implemented



Vocabulary Size Analysis
Cross Validation Method
Scale-spaced features
Complementary features (GIST)
Fisher Encoding
Gaussian/RBF Kernel
Spatial Pyramids

Vocabulary Size Analysis

Analysis of different feature types over classification algorithms and different cluster sizes. The code was run for cluster size 400 also, but considering time and accuracy trade off I choose not to display the results.

Cross Validation

Cross Validation is run for N= 100, 200, 1000 iterations with Bags of SIFT features and Linear SVM reporting highest accuracy of 68%.

Scale-spaced features

We notice that multiple scale features (over 4 scales here) provide not much of a change in final accuracy.

Complementary features

We implement GIST features over the three classification algorithms and notice sudden spike in accuracy levels. This can be because GIST provides a low dimensional representation of the scene, which does not require any form of segmentation. This concatenated with bag of words sift features provides high accuracy on cluster sizes 50-200

Fisher Encoding

For K=200, Fisher encoding gave high results ~ 78% with gaussian mixture model trained with 200 clusters.

Gaussian/RBF Kernel

Primal SVM code is used with a self-implemented RBF kernel.

Spatial Pyramid

Features are concatenated over Levels 0 and 1 to give comparable results to GIST descriptor features for cluster sizes 200 and 150. and 150.

Results visualization for poorly performing recognition pipeline.

Accuracy (mean of diagonal of confusion matrix) is 0.261

Category name Accuracy Sample training images Sample true positives False positives with true label False negatives with wrong predicted label

Kitchen 0.480
Industrial
Forest
Highway
Forest

Store 0.340
Office
TallBuilding
TallBuilding
Coast

Bedroom 0.300
InsideCity
Store
OpenCountry
Mountain

LivingRoom 0.250
Office
Kitchen
Bedroom
Office

Office 0.590
Mountain
Bedroom
Industrial
InsideCity

Industrial 0.330
LivingRoom
TallBuilding
InsideCity
Store

Suburb 0.570
Coast
InsideCity
Kitchen
Street

InsideCity 0.300
Suburb
Coast
TallBuilding
Store

TallBuilding 0.580
Store
Forest
LivingRoom
Coast

Street 0.500
Kitchen
Store
Highway
Suburb

Highway 0.750
Suburb
Industrial
Coast
Mountain

OpenCountry 0.310
Forest
Kitchen
Suburb
Coast

Coast 0.670
Mountain
Mountain
InsideCity
OpenCountry

Mountain 0.700
Store
Bedroom
Kitchen
Forest

Forest 0.640
Mountain
Street
Suburb
Kitchen

Category name Accuracy Sample training images Sample true positives False positives with true label False negatives with wrong predicted label

Akanksha Bindal

Project 4 / Scene Recognition with Bag of Words