Project 4 / Scene Recognition with Bag of Words

For this project, 2 different image representations as well as 2 different classifications are implemented. The 2 image representations are tiny images and bags of SIFT features. The 2 classifications are nearest neighbor and SVM. The following 3 combinations of image representations and classfications are completed and compared for results:

Summary:

  1. Tiny Images
  2. Bags of SIFT features
  3. Nearest Neighbor Classifier
  4. SVM Classifier
  5. Results
  6. Best Results

Tiny Images

Tiny images are obtained literally by resizing each original image into 16x16 squares. The code is shown below:


	size = 16;
	image_feats = zeros([length(image_paths), size*size]);

	for i=1:length(image_paths)
	    img = imread(image_paths{i});
	    img = imresize(img, [size, size], 'bicubic');
	    image_feats(i, :) =  img(:);
	end
	

Bags of SIFT features

Bags of SIFT features are obtained from building a histogram of existing features(vocab). For each image, the sift features are obtained by using vl_dsift. The distance matrix of SIFT features and vocab is calculated and sorted that the lowest distance for any feature to a vocab is said to correspond to that particular vocab. A histogram counts how many vocab is represented. The code is shown below:


	load('vocab.mat')
	vocab_size = size(vocab, 2);
	N = length(image_paths);
	image_feats = zeros([N, vocab_size]);

	for i=1:N
	    image = imread(image_paths{i});
	    
	    [~, features] = vl_dsift(single(image), 'step', 5);
	    D = vl_alldist2(single(features), vocab);
	    [~, index] = sort(D, 2);
	    
	    hist = zeros([1, vocab_size]);
	    for j=1:size(index, 1)
	        hist(index(j, 1)) = hist(index(j, 1)) + 1;
	    end
	    hist = hist ./ norm(hist);
	    image_feats(i, :) = hist;
	end
	

A step size of 5 for vl_dsift is chosen for performance/speed trade off.

Nearest Neighbor Classifier

Nearest neighbor classifier works by calculating distances between test image features and train image features. The distance matrix is sorted so that the lowest distance index is on the first column. The index is then turned into labels for each image. The code is shown below:

		
	D = vl_alldist2(test_image_feats', train_image_feats');
	[~, index] = sort(D, 2);

	predicted_categories = train_labels(index(:,1), :);
		
	

SVM Classifier

15 classifiers for built. One for each unique category in the labels. For each category, the labels for all training images are turned into -1 and 1 for positive and negative. A SVM classifier is trained using the new labels. Using the classifier, test images obtain either positive or negative values where positive value indicates a potential true and negative value indicates a potential false. The magnitude of the values indicate the confidence of the classifier. The results for all images are stored and sorted so that the highest value index is converted into a text label that represents the category for images. The code is shown below:

		
	categories = unique(train_labels); 
	num_categories = length(categories);
	M = size(test_image_feats, 1);
	LAMBDA = 0.000049;
	results = zeros([num_categories, M]); 
	for i=1:num_categories
	     matching_indices = strcmp(categories{i}, train_labels);
	     labels = zeros(length(train_labels), 1);
	     for j=1:length(matching_indices)
	        if(matching_indices(j) == 1)
	            labels(j) = 1;
	        else
	            labels(j) = -1;
	        end
	     end
	    [W, B] = vl_svmtrain(train_image_feats', labels, LAMBDA);
	    results(i, :) = W' * test_image_feats' + B;
	end

	[~, index] = sort(results, 1, 'descend');
	predicted_categories = categories(index(1, :));
		
	

Different lambda between 0.01 and 0.00001 are tried. It is obvious that some value in between the two will produce a best performing lambda, thus a binary search is used with everything else constant.

Results without time constraints

Results with time constraints

Best Result

Best result comes from the implementation of

Scene classification results visualization


Accuracy (mean of diagonal of confusion matrix) is 0.693

Category name Accuracy Sample training images Sample true positives False positives with true label False negatives with wrong predicted label
Kitchen 0.580
Store

LivingRoom

LivingRoom

Industrial
Store 0.530
InsideCity

Forest

Street

Forest
Bedroom 0.510
Kitchen

Kitchen

Store

LivingRoom
LivingRoom 0.450
Industrial

Bedroom

Kitchen

Office
Office 0.910
Bedroom

LivingRoom

Bedroom

Street
Industrial 0.560
InsideCity

Store

OpenCountry

LivingRoom
Suburb 0.930
OpenCountry

Mountain

Industrial

InsideCity
InsideCity 0.570
Industrial

LivingRoom

Street

Industrial
TallBuilding 0.760
InsideCity

LivingRoom

Industrial

Industrial
Street 0.740
TallBuilding

Industrial

Highway

InsideCity
Highway 0.810
Coast

Street

Mountain

Coast
OpenCountry 0.460
Highway

Industrial

Highway

Coast
Coast 0.820
OpenCountry

Highway

OpenCountry

OpenCountry
Mountain 0.810
Forest

Industrial

Suburb

Highway
Forest 0.950
Store

Store

Mountain

Mountain
Category name Accuracy Sample training images Sample true positives False positives with true label False negatives with wrong predicted label

It can be seen that this SVM and bags of SIFT features combinations works the best with forest and mountain categories. It is undestood since those landscapes usually possess lots of unqiue identifing combinations of features.