Project 4: Scene Recognition with Bag of Words

The project is to classifying images to different scene categaories. Different combinations of feature descriptor and classification algorithm are implemented.

Tiny images representation and nearest neighbor classifier

After normalizing (subtracting mean and divided by norm of the 16*16 vector) the tiny image, and pick 7 nearest neighbor classification algorithm. The accuracy (mean of diagonal of confusion matrix) is 24.7%. Show result

Bag of SIFT and nearest neighbor classifier

First, we try the basic bag of SIFT features (i.e. hard assignment to nearest vocabulary, without feature from different scale, without spatial pyramid). We choose SIFT bin size 4px (2*2 bins) and build vocabulary (200 words) with step size 32px and scan train and test images with step size 16px. With 7-nearest neighbor classifier, the accuracy is 47.9%. Show result

Next we add feature from different scale. We build a 3-level gaussian pyramid by first using a 5*5 gaussian filter of standard deviation 1 and then downsampling the image by 2. With 7-nearest neighbor classifier, the accuracy increses to 50.9%. Show result

Bag of SIFT and 1 vs all linear SVM

We further add spatial information by building spatial pyramid of 3 levels. Concatenate features of 3 levels. And use "soft assignment" (kernel codebook encoding) instead of "hard assginment" to assign visual words to histogram bins. Each visual word will cast a distance-weighted vote to multiple bins. The weight function is exp(-0.5*dist(x,vocab)^2/sigma^2) where dist(x,vocab) is the Euclidean distance of SIFT feature to vocabulary (cluster center). Sigma describe the width of kernel function. After several trials, we choose sigma^2=9000. We use 1 vs all SVM instead of knn, the accuracy increases to 70.9%. Show result

Extra credit:

(up to 3pts) Feature representation: Experiment with features at multiple scales(sampling features from different levels of a Gaussian pyramid).

(up to 3pts) Feature quantization and bag of words: Use "soft assignment" to assign visual words to histogram bins.

(up to 4pts) Spatial Pyramid representation and classifier:Add spatial information to your features by implementing the spatial pyramid feature representation.

(up to 3pts) Experimental design:Use cross-validation to measure performance. The get_image_paths_cross_validation.m is used for cross validation. The script randoms selects 50 images from the train folder and 50 images from the test folder as train images and the rest are test images. If you want to use cross validation, replace get_image_paths.m with get_image_paths_cross_validation.m and also remember to rebuild the vocabulary.

(up to 3pts)Experimental design: Experiment with many different vocabulary sizes and report performance.

  1. Vocab size = 10, accuracy = 50.7% Show result.
  2. Vocab size = 20, accuracy = 58.3% Show result.
  3. Vocab size = 200, accuracy = 70.9% Show result.