Project 4 / Scene Recognition with Bag of Words

This project involved building a pipeline that classifies images into one of 15 possible scenes. Provided 1500 training images, 100 of each category, I:

The list of scenes was: Kitchen, Store, Bedroom, LivingRoom, Office, Industrial, Suburb, InsideCity, TallBuilding, Street, Highway, OpenCountry, Coast, Mountain, and Forest. One of the difficulties of this choice of classes is that some of these classes may share similar qualities. For instance, Kitchen, Bedroom, and LivingRoom are all residential spaces, InsideCity and TallBuilding both refer to urban architecture, and Street and Highway both refer to roadways.

Tiny Images + NN

I implemented the tiny images feature to provide a baseline that performs slightly better than random guessing at classifying scenes. I used 16x16 as the resolution of the tiny images, and also normalized the images such that they had zero mean and unit length. The functional part of the code for this piece is below.

		
for i = 1:size(image_paths,1)
    original = imread(image_paths{i});
    % shrink the image
    tiny = imresize(original, [imsize, imsize]);
    % turn it into a 1D vector
    tinyvec = reshape(tiny, 1, imsize*imsize);
    % normalize
    tinyvec = double(tinyvec);
    tinyvec = tinyvec - mean(tinyvec);
    tinyvec = tinyvec./norm(tinyvec);
    image_feats(i,:) = tinyvec;
end
	

I also implemented the nearest neighbor classifier, which was also used for the bag of words feature representation later. The code below simply computes the distance matrix between the input vectors, which are the train and test image feature sets, then finds the training image with smallest distance to the test image in consideration in feature space. Then the test image is assigned the same label as its nearest neighbor training image. Neither the tiny image feature representation nor nearest neighbor classifier have many hyperparameters to tweak in the case of this project, although in general nearest neighbor can be expanded to include the k nearest neighbors, rather than the single nearest one. The tiny image representation can be tweaked by changing the size of the tiny image, but this was not explored for this project, as I only intended to use it as a baseline.

	
D_matrix = vl_alldist2(train_image_feats', test_image_feats');
for i = 1:length(test_image_feats)
    %find nearest neighbor for this feature
    dists = D_matrix(:,i);
    [best_match, best_ind] = min(dists);
    predicted_categories{i} = train_labels(best_ind);
end

Bag of SIFT words representation

Next came the building of a better feature representation. This is where many of the tweaked hyperparameters will come into play. First, I build a vocabulary of visual words. This includes sampling up to 100,000 SIFT features at random from the entire training set and clustering them with k-means clustering. In this portion, the number of samples, the number of clusters, and the SIFT parameters such as sampling rate (bin size) can be tuned for different results. To start, I created 50 clusters after sampling 30,000 features, and a bin size of 25 for SIFT.

After constructing the vocabulary, we created our bag of words features by building a histogram of words for each train and test image. For each SIFT feature in an image, add to the histogram indexed by the nearest cluster that instance would be grouped into. Thus, images whose amounts of each type of feature correlate will be scored as similar. The below code builds the histogram feature by finding the nearest cluster for each feature in an image.

	
for i = 1:size(image_paths,1)
    img = imread(image_paths{i});
    img = single(img);
    hist = zeros(1,vocab_size);
    %get a bunch of sift features for this image
    [~, SIFT_features] = vl_dsift(img, 'size', bin_size, 'fast');
    D = vl_alldist2(double(SIFT_features), vocab);
    %find the visual word of minimum distance (nearest neighbor) for each SIFT feature
    for j = 1:size(SIFT_features,2)
        dists = D(j,:);
        [~, best_ind] = min(dists);
        hist(best_ind) = hist(best_ind) + 1;
    end
    image_feats(i,:) = hist;
end

SVM classifier

Finally, a linear SVM classifier was used to try to improve accuracy over the 1-NN implementation. This introduced another hyperparameter lambda, the regularization constant, which protects against overfitting. Since normally, SVMs are binary classifiers, I used 15 binary classifiers to choose among the 15 possible classes. A code snippet is provided below that shows how the class is chosen by selecting the classifier that most confidently predicts the class of the image.

		
for i = 1:size(test_image_feats,1)
    %get confidence from each classifier
    confidences = zeros(1,num_categories);
    xtest = test_image_feats(i,:)';
    for j = 1:num_categories
        W = W_all(:,j);
        B = B_all(j);
        %predict on test image
        confidences(j) = W'*xtest + B;
    end
    [best_confidence, best_ind] = max(confidences);
    %select category with best confidence
    predicted_categories{i} = categories{best_ind};
end
	

Detailed results

After testing multiple values of the vocabulary size, SIFT bin size, number of samples used to build the vocabulary, and SVM regularization, my results for each method are below:
feature type classifier accuracy
tiny image nearest neighbor 22.5%
bag of words nearest neighbor 46.3%
bag of words 1-vs-all SVMs 54.1%

These results were all obtained using 100,000 SIFT feature samples in building the vocabulary, a vocabulary size of 200, and a SIFT sampling (bin size) of 10, when applicable. Interestingly, in my implementation, a seemingly very high lambda value of 1e6 provided the highest accuracy for any given vocabulary, after testing values from 1e-4 to 1e8. Lower values seemed to cause the model to overtrain in favor of classifying instances as Bedroom, and higher values lowered the accuracy by underfitting. The combination of bag of SIFT features and SVM classifier provided the best accuracy, and more details on the accuracy are provided below.


Accuracy (mean of diagonal of confusion matrix) is 0.541

Category name Accuracy Sample training images Sample true positives False positives with true label False negatives with wrong predicted label
Kitchen 0.560
Industrial

Office

InsideCity

LivingRoom
Store 0.640
LivingRoom

Street

Street

Industrial
Bedroom 0.370
TallBuilding

LivingRoom

Industrial

Industrial
LivingRoom 0.470
Industrial

Kitchen

Kitchen

Store
Office 0.790
Industrial

Industrial

Kitchen

LivingRoom
Industrial 0.320
Store

Street

OpenCountry

LivingRoom
Suburb 0.710
Industrial

Industrial

Store

Bedroom
InsideCity 0.460
Kitchen

Bedroom

Store

LivingRoom
TallBuilding 0.540
Street

Industrial

Street

LivingRoom
Street 0.380
Mountain

Mountain

LivingRoom

Store
Highway 0.840
OpenCountry

Coast

OpenCountry

Store
OpenCountry 0.510
Mountain

Forest

Coast

Highway
Coast 0.610
OpenCountry

OpenCountry

OpenCountry

Highway
Mountain 0.580
Kitchen

Forest

Highway

Highway
Forest 0.330
OpenCountry

Store

Mountain
Category name Accuracy Sample training images Sample true positives False positives with true label False negatives with wrong predicted label

One of my conclusions from looking at these results is that the more "overlapping" categories, such as Street and Industrial, seem to have lower accuracy than the more distinct categories, like Suburb and Highway. Keeping the categories more distinct could improve accuracy. It also seems that in at least some cases, the false positives and negatives are reasonable. For example, Living Room, Kitchen, and Bedroom are all areas in the home and all are misclassified as each other. Another example is the false positives for Highway, which have the distinct horizontal lines across the entire image that are associated with images of highways.