Project 4 / Scene Recognition with Bag of Words

My implementation of this homework differed from the instructions only in that I implemented k-nearest neighbors, rather than simply nearest-neighbor. However, this extra effort clearly does not pay off, as the following table illustrates. Here we have bag-of-sift features with several values for k:

k12345
mean accuracy0.4970.4760.4430.4190.390

I suggest that this decrease in performance is due to the interface between classes being highly discontinuous.

Here's a full report for k-nearest-neighbors with the best value for k, k=1, with bag-of-sift features.

Confusion Matrix


Accuracy (mean of diagonal of confusion matrix) is 0.497

Category name Accuracy Sample training images Sample true positives False positives with true label False negatives with wrong predicted label
Kitchen 0.420
Industrial

Office

Office

Office
Store 0.460
TallBuilding

Industrial

Kitchen

LivingRoom
Bedroom 0.220
LivingRoom

Store

LivingRoom

Office
LivingRoom 0.480
Highway

Street

Street

Bedroom
Office 0.770
Bedroom

Kitchen

Bedroom

Bedroom
Industrial 0.270
TallBuilding

Street

Street

TallBuilding
Suburb 0.850
Mountain

Bedroom

Industrial

Coast
InsideCity 0.260
Street

Store

Kitchen

Store
TallBuilding 0.330
Industrial

Bedroom

Industrial

LivingRoom
Street 0.530
Highway

Highway

Store

TallBuilding
Highway 0.680
Coast

Industrial

Coast

Mountain
OpenCountry 0.440
Mountain

Coast

Mountain

Suburb
Coast 0.410
Highway

Highway

Suburb

OpenCountry
Mountain 0.500
Industrial

Highway

Suburb

TallBuilding
Forest 0.840
Mountain

Mountain

Store

OpenCountry
Category name Accuracy Sample training images Sample true positives False positives with true label False negatives with wrong predicted label

As you can see, nearest-neighbor with bag-of-sift performs fairly well for images that are fairly "canonical", such as the forest images. They are canonical in that most forests are the same: many trees, many leaves. Forests tend to contain many high frequency features (leaves, branches) that SIFT encodes well, and thus can be recognized with a high degree of certainty fairly easily. It performs well on "highway" as well, for the same reasons.

Some inexplicably high performance, however, comes from the Suburb images. Most suburban houses do not look the same, from the standpoint of the algorithm. So its (record) high performance goes without explanation.

Running 1-nearest-neighbor on tiny images gave a mean accuracy of about 13%. I will not say more because this particular algorithm because it is not interesting.

Here's the mean accuracy of 1-vs-all SVM linear classifiers using bag-of-sift features, varying the value of lambda.

lambda100.10.0010.00010.000010.000001
mean accuracy0.3450.1720.4190.5450.6350.649

The accuracy tapers off after 1e-6. This optimum probably has to do with the scale of the feature data, and probably works for all feature data on the same scale.