Project 4 / Scene Recognition with Bag of Words

The goal of this project is to classify scenes into one of 15 categories by training and testing on the 15 scene database using different image presentation and and classifier.

  1. Tiny images representation and nearest neighbor classifier
  2. Bag of SIFT representation and nearest neighbor classifier
  3. Bag of SIFT representation and linear SVM classifier
  4. Bag of SIFT representation (Spatial Pyramid) and linear SVM classifier
  5. SIFT representation and naive bayes nearest neighbor classifier
  6. Fisher encoding for SIFT features and linear SVM classifier

Tiny images representation and nearest neighbor classifier

The "tiny image" feature is one of the simplest possible image representations. I resize each image to 16x16, then make it zero mean and unit length as a representation.

Next, calculate the L2 distance between every test image and every train image. And find the nearest train image and assgin the label to the test image.

Accuracy is 0.224

Bag of SIFT and 1-NN

Build a vocabulary of visual words by extracting SIFT features from training set with step size=15, then clustering them into 200 visual words with kmeans.

Represent training and testing images as histograms of visual words. For each image we will densely sample many SIFT descriptors with step size=5. Instead of storing hundreds of SIFT descriptors, we simply count how many SIFT descriptors fall into each cluster in our visual word vocabulary. This is done by finding the nearest neighbor kmeans centroid for every SIFT feature.

Use 1-NN to classify each test image

Accuracy is 0.514

Bag of SIFT and Linear SVM

Build a vocabulary of visual words by extracting SIFT features from training set with step size=15, then clustering them into n visual words with kmeans.

Represent training and testing images as histograms of visual words. For each image we will densely sample many SIFT descriptors with step size=5. Instead of storing hundreds of SIFT descriptors, we simply count how many SIFT descriptors fall into each cluster in our visual word vocabulary. This is done by finding the nearest neighbor kmeans centroid for every SIFT feature.

Use Linear SVM to classify each test image (Lambda=1e-6)

Vocab_size 10 20 50 100 200 400 1000
Accuracy .394 .499 .598 .655 .664 .699 .713

Accuracy is 0.713

Bag of SIFT representation (Single Spatial Level) and linear SVM classifier

partition image evenly, build one histogram for each patch, then concatenate them together.

Vocab_size 100 200 400 500 1000
Accuracy(2 levels) .709 .736 .743 .749 .749
Accuracy(3 levels) .733 .749 .744 .751 .745

Bag of SIFT representation (Spatial Pyramid) and linear SVM classifier

Instead of building one histogram for each image, partition each image in each level and build 1+4+16... histograms for each image and concatenate them together

Vocab_size 100 200 400 500 1000
Accuracy(2 levels) .741 .751 .763 .771 .762
Accuracy(3 levels) .751 .757 .757 .749 .741

Accuracy is 0.771

Fisher encoding for SIFT features and linear SVM classifier

Using gaussian mixture model clustering instead of kmeans clustering. and Fisher encoding to represent images.

Accuracy is 0.821

SIFT representation and naive bayes nearest neighbor classifier

Because the algorithm runs too slow, I have to sample SIFT features less. With stepSize 35,40,50

stepSize 35 40 50
Accuracy bag of SIFT+SVM .447 .401 .348
Accuracy SIFT+NBNN .518 .514 .383

Accuracy is 0.518

NBNN performs slightly better than bag of SIFT+SVM with the same stepSize. If I can sample features denser, I think I will get a much better result on NBNN

Results visualization for best performing recognition pipeline.

Scene classification results visualization


Accuracy (mean of diagonal of confusion matrix) is 0.821

Category name Accuracy Sample training images Sample true positives False positives with true label False negatives with wrong predicted label
Kitchen 0.730
Bedroom

LivingRoom

Office

Industrial
Store 0.830
Industrial

Industrial

InsideCity

InsideCity
Bedroom 0.590
LivingRoom

LivingRoom

Industrial

TallBuilding
LivingRoom 0.570
Bedroom

Store

Bedroom

Bedroom
Office 0.980
Kitchen

LivingRoom

Kitchen

Bedroom
Industrial 0.820
Bedroom

Bedroom

Suburb

Store
Suburb 1.000
Store

OpenCountry
InsideCity 0.930
TallBuilding

Street

TallBuilding

Street
TallBuilding 0.870
Industrial

Bedroom

Forest

Mountain
Street 0.810
TallBuilding

TallBuilding

Highway

InsideCity
Highway 0.850
Street

Street

Coast

OpenCountry
OpenCountry 0.590
Coast

Mountain

Coast

Coast
Coast 0.890
Highway

OpenCountry

OpenCountry

Mountain
Mountain 0.900
OpenCountry

OpenCountry

OpenCountry

Forest
Forest 0.960
OpenCountry

OpenCountry

Mountain

Mountain
Category name Accuracy Sample training images Sample true positives False positives with true label False negatives with wrong predicted label