Project 4 / Scene Recognition with Bag of Words

Algorithm Description

Tiny Images Features

The Tiny Images classifier is quite simple, in that each image is just shrunk to a 16x16 image (without accounting for aspect ratio), and the pixel values of that 16x16 image are flattened into a 256 entry vector. This vector is then normalized to unit length and also made to have a zero mean.

Building the Vocabulary for Bag of Sifts

While buiding the vocabulary, to prevent overfitting as well as to decrease the runtime, half of the images are chosen at random to be sampled. From there, the SIFT features for each image are calculated, and 20 of those features are chosen to be added to the SIFT features pool. 20 was chosen as it provided the best accuracy for the memory usage. From there, the K-means clustering algorithm is performed, clustering the pool of SIFT features into "vocabulary size" clusters. The centers of these clusters are then returned as the dictionary. After testing, it became apparent that the performance increase due to a window size of 10 and the "fast" parameter being set for the SIFT algorithm outweighed the slight decrease in accuracy. Because of the randomness in the vocabulary building function, it is virtually impossible to recreate the exact same vocabulary. However, the difference in accuracy between different vocabularies should be 1 to 2% at a maximum.

Bag of SIFTs Feature

After the vocabulary is loaded into the script, the SIFT features for each image is calculated. After testing, it appeared that having a step size of 8 as well as the fast flag provided the best tradeoff in performance and accuracy. Then the Euclidean distance between each SIFT feature in each image and each visual word in the vocabulary is calculated. After this, the nearest words for each SIFT feature in each image are clustered into a histogram for each image. This histogram (with vocabulary size buckets) is then normalized (to add a level of size invariance) and used as the feature for its respective image.

Nearest Neighbor Classifier

This simple classifier calculates the Euclidean distance between the respective image's features and the training image's features. The predicted categories for each image feature are simply the categories that the repsective image's features are closest to in Euclidean distance.

SVM Classifier

After the categories are determined, for each category, the training features' labels are first checked to see if they are part of the specified category. Then, a matrix is created linking each training feature to a 1 if the feature is in the specified category and -1 otherwise. This is then passed into the SVM training function to create the W vector and the B value for the specific category.

Then, for each test image feature passed into the function, the feature is checked using the W and B values for each category. The category that yields the highest value in the function W*Feat + B is then the predicted category.

Results

Results for Tiny Images Features with Nearest Neighbor: Accuracy (mean of diagonal of confusion matrix) is 0.198

Bag of SIFT Features with Nearest Neighbor: Accuracy (mean of diagonal of confusion matrix) is 0.526

Best Results: Bag of SIFT Features with SVM

Accuracy (mean of diagonal of confusion matrix) is 0.639

Scene classification results visualization


Accuracy (mean of diagonal of confusion matrix) is 0.639

Category name Accuracy Sample training images Sample true positives False positives with true label False negatives with wrong predicted label
Kitchen 0.580
Store

LivingRoom

Street

InsideCity
Store 0.560
LivingRoom

Kitchen

Industrial

Highway
Bedroom 0.440
Street

LivingRoom

Office

Kitchen
LivingRoom 0.260
Office

Bedroom

Bedroom

Bedroom
Office 0.790
Bedroom

Kitchen

LivingRoom

Kitchen
Industrial 0.490
Bedroom

Kitchen

Store

Store
Suburb 0.920
TallBuilding

Highway

Store

LivingRoom
InsideCity 0.510
Highway

TallBuilding

Suburb

Store
TallBuilding 0.660
InsideCity

Street

InsideCity

Industrial
Street 0.600
Industrial

InsideCity

Suburb

InsideCity
Highway 0.760
Industrial

Street

Suburb

Store
OpenCountry 0.510
Mountain

Coast

Coast

Highway
Coast 0.740
OpenCountry

OpenCountry

Bedroom

Mountain
Mountain 0.750
InsideCity

TallBuilding

Coast

Coast
Forest 0.920
OpenCountry

OpenCountry

Store

Street
Category name Accuracy Sample training images Sample true positives False positives with true label False negatives with wrong predicted label

In the end, the best-functioning pipeline was the linear SVM coupled with the Bag of SIFTs features. After testing, the classification of the linear SVM yielded the highest accuracy in comparison to the Nearest Neighbor classifier.