Project 4: Scene Recognition with Bag of Words

Example Confusion Matrix

In this project we attempt to do scene recognition on various test images. We implement different feature extraction methods and different classifiying methods. For feature extraction tiny images and bag of SIFT features are implementd. For classification, we implement nearest neighbor and a 1 vs all linear SVM. The flow of this process is pretty simple: for all the images, training and test, we extract features, we then classify each test image based on some metric using the training images as reference.

Tiny Images

Tiny images is a simple and poor way to extract features. It simply resizes the image in to a 16x16 image. It then takes this matrix and converts it to a vector 256 elements long. This is now the 'feature' that represents this image. This is a very simple and over simplified feature extraction method, and therefore does not produce very accurate results.

Nearest Neighbor

Nearest neighbor is the simplest type of classifier. In this classifer, we find the distance from each test image to every train image. For every test image we find the closest training image and that is the label for the training image.

K-NN Extension

To implement K-NN each of the closest neighbors vote towards what the label should. The parameter k determines how many closest neighbors get a vote. The label with the highest vote amongst the k closest neighbors is assigned to the training label.

Bag of Sift

Build Vocabulary

In order to extract features using of bag of SIFT, we must first create a vocabulary of image "words" based on our training images and labels. To do this, we sample a certain number of SIFT features across all the training images. I decided to choose an equal number of features from every training image. For every image, I create a set of SIFT features, then randomly sample features from that set. Once I've gathered tens of thousands of SIFT features, I then create a vocabulary. I create a vocabulary through K-Means using VL's vl_kmeans function.


  % Read each image, get sift features, sample random set
  for i=1:size(image_paths,1)
    image = im2single(imread(image_paths{i}));
    [~, siftFeatures] = vl_dsift(image, 'step', siftStepSize);
    sampledSiftFeatures = [sampledSiftFeatures datasample(siftFeatures, siftFeaturesPerImage, 2)];
  end

  % Generate vocabulary
  [vocab, ~] = vl_kmeans(single(sampledSiftFeatures), vocab_size);
          

Feature Extraction

To extract features, we must describe each image with a bag of words using our pre-built vocabulary. For every image, we get it's SIFT features. We then find the distance of every extracted SIFT to every word in the vocabulary. The dimensions for a word and a SIFT feature are the same since we built the vocabulary using SIFT features. Then for each feature we select the closes word and that is the word for that SIFT feature. We create a histogram of all the words and this histogram becomes the feature descriptor for this image.


  % Generate SIFT features
  [~, siftFeatures] = vl_dsift(image, 'step', siftStepSize);

  % siftDistances(i,j) = local feature i distance to vocab word j
  siftDistances = vl_alldist2(single(siftFeatures), vocab);

  % Get words for each sift feature
  [~, indecies] = min(siftDistances, [], 2);

  % Create histogram of words based on local features
  siftHistogram = zeros(1, vocab_size);
  for j=1:size(indecies, 1)
    wordIndex = indecies(j, 1);
    siftHistogram(1, wordIndex) = siftHistogram(1, wordIndex) + 1;
  end
          

SVM Classifier

The final part was to create a linear Support Vector Machine (SVM) classifer. Since we have 15 different categories our method uses a 1-vs-all approach. For every category, we train an SVM and then evaluate each test image with it. At the end we have a distance matrix with a value for each test image evaluated by each SVM. For every image, the SVM that produced the most positive value is the category the test image corresponds to.


  % Find binary results with train labels, change 0s to -1s
  matchingIndices = double(strcmp(currentCategory, train_labels));
  matchingIndices(matchingIndices==0)=-1;

  % Get classifier for category
  [W B] = vl_svmtrain(train_image_feats', matchingIndices, LAMBDA);

  % Evaluate each test feature with this classifier
  distances(:,i) = W'*test_image_feats' + B;
          

Results (under 10 minutes)

Feature Classifier Feature Params Classify Params Accuracy
Tiny Images K-NN 16x16 k=10, L1 0.220
Bag of Sift K-NN step size=8, fast enabled, vocab size=400, vocab samples=50,000, vocabstep size=1 k=10, L1 0.515
Bag of Sift SVM step size=8, fast enabled, vocab size=400, vocab samples=50,000, vocabstep size=1 Lamda=0.001 0.630

Other Results

Feature Classifier Feature Params Classify Params Accuracy
Bag of Sift SVM step size=5, vocab size=400, vocab samples=50,000, vocabstep size=1 Lamda=0.00001 0.691
Bag of Sift SVM step size=5, vocab size=10, vocab samples=50,000, vocabstep size=2 Lamda=0.001 0.38867
Bag of Sift SVM step size=5, vocab size=25, vocab samples=50,000, vocabstep size=2 Lamda=0.001 0.52667
Bag of Sift SVM step size=5, vocab size=50, vocab samples=50,000, vocabstep size=2 Lamda=0.001 0.57867
Bag of Sift SVM step size=5, vocab size=100, vocab samples=50,000, vocabstep size=2 Lamda=0.001 0.58733
Bag of Sift SVM step size=5, vocab size=250, vocab samples=50,000, vocabstep size=2 Lamda=0.001 0.62133
Bag of Sift SVM step size=5, vocab size=500, vocab samples=50,000, vocabstep size=2 Lamda=0.001 0.65333
Bag of Sift SVM step size=5, vocab size=1000, vocab samples=50,000, vocabstep size=2 Lamda=0.001 0.64867
Bag of Sift SVM step size=5, vocab size=2500, vocab samples=50,000, vocabstep size=2 Lamda=0.001 0.65467
Bag of Sift SVM step size=5, vocab size=5000, vocab samples=50,000, vocabstep size=2 Lamda=0.001 0.64333

Variable Vocabulary Results

I also implmented a file named "multi_vocab.m" that creates vocabularies of varying sizes, and then runs bag of sift and SVM with those vocabularies. Vocabulary size increases the accuracy of classification, but soon levels off around 1000. A graph of the response is shown below, as well as a gif of all the confusion matrices generated.

Confusion Matrices

Accuracy vs Vocab Size

Best Results

Accuracy (mean of diagonal of confusion matrix) is 0.691

Category name Accuracy Sample training images Sample true positives False positives with true label False negatives with wrong predicted label
Kitchen 0.650
Bedroom

Office

Store

Suburb
Store 0.530
Kitchen

Street

InsideCity

Industrial
Bedroom 0.470
LivingRoom

Store

LivingRoom

LivingRoom
LivingRoom 0.290
Kitchen

Street

Bedroom

Office
Office 0.960
Bedroom

LivingRoom

Bedroom

Kitchen
Industrial 0.690
Bedroom

LivingRoom

TallBuilding

Highway
Suburb 1.000
Mountain

Bedroom
InsideCity 0.610
Highway

Street

Store

Industrial
TallBuilding 0.720
Industrial

InsideCity

InsideCity

Industrial
Street 0.540
Highway

TallBuilding

InsideCity

TallBuilding
Highway 0.830
Street

Coast

Suburb

Coast
OpenCountry 0.510
Mountain

Mountain

Suburb

Forest
Coast 0.770
OpenCountry

OpenCountry

OpenCountry

OpenCountry
Mountain 0.860
Highway

OpenCountry

Coast

Forest
Forest 0.930
OpenCountry

OpenCountry

Street

Mountain
Category name Accuracy Sample training images Sample true positives False positives with true label False negatives with wrong predicted label