Project 4 / Scene Recognition with Bag of Words

In this project, I implemented several image recognition algorithms, I started with simple methods: tiny images and nearest neighbor classification and then move to more advanced and better methods: bags of quantized local features and linear classifiers learned by support vector machines. The performance was much better than chance, and the latter method performed much better than the first simple one.

Tiny Image and Nearest Neighbour Classification

I first implemented tiny image as features and used nearest neighbor classification algorithm to recognize different test images. The accuracy was improved to 19.9%, 12.9% more than chance. I set k to 1, so it was very sensitive to noise. The performance might be better if I could set k to 3 or 4.

Bags of Quantized SIFT Features and Nearest Neighbour Classification

Then I replaced tiny image with quantized SIFT features. To get the SIFT features, I used vl_dsift() method in vlfeat library. I enabled 'fast' mode to make the algorithm run much faster. I realized that by setting step parameter larger, the algorithm can run much faster with lower accuracy. I tested the step size from 30 to 4, and finally by setting the step in build vocabulary to 4 and in bag of SIFT to 3, I achieved the best performance with 51.6% accuracy. I used accumarray() function to distribute features to histograms, so that it will run faster than I manually distribute them in a for loop.

Bags of Quantized SIFT Features and SVM

Finally I replaced knn with 1-vs-all linear SVMs, because SVM is much more robust to noise than kNN. I used vl_svmtrain in vlfeat library to train SVM classifiers, and I muted the variable lambda and step size in bags of SIFT to make sure that I could get the best accuracy. After testing lambda from 1 to 0.00001, I found that by setting lambda smaller, the algorithm tended to perform better. Finally, I found that if I do not consider time and set the step size of building vocabulary to 4, step size of bag of SIFT to 3, and lambda to 0.00001, I can achieve the best performance with 65.5% accuracy. The detailed classification is presented below. We can see that the mistakes are actually very hard to classify, i.e. it is hard to tell the difference between tall building and large pipe in industrial, as they are all tall straight structures. To make sure that the algorithm can run within 10 minutes, I set the step size back to 10 and 8, and the accuracy is normally between 55% to 60% in that case.

Scene classification results visualization


Accuracy (mean of diagonal of confusion matrix) is 0.655

Category name Accuracy Sample training images Sample true positives False positives with true label False negatives with wrong predicted label
Kitchen 0.640
LivingRoom

LivingRoom

Bedroom

InsideCity
Store 0.520
Bedroom

Kitchen

Street

InsideCity
Bedroom 0.330
LivingRoom

LivingRoom

Kitchen

LivingRoom
LivingRoom 0.270
Bedroom

Bedroom

TallBuilding

Office
Office 0.950
LivingRoom

LivingRoom

Bedroom

Kitchen
Industrial 0.380
TallBuilding

InsideCity

TallBuilding

LivingRoom
Suburb 0.970
Industrial

LivingRoom

InsideCity

OpenCountry
InsideCity 0.560
Store

Industrial

Store

LivingRoom
TallBuilding 0.790
Bedroom

Industrial

Kitchen

Office
Street 0.630
Store

OpenCountry

TallBuilding

Highway
Highway 0.770
Industrial

Street

Coast

Street
OpenCountry 0.410
Coast

Coast

TallBuilding

Coast
Coast 0.820
OpenCountry

OpenCountry

Suburb

Highway
Mountain 0.850
Street

LivingRoom

Forest

Suburb
Forest 0.930
Highway

Store

Mountain

Mountain
Category name Accuracy Sample training images Sample true positives False positives with true label False negatives with wrong predicted label