Project 4 / Scene Recognition with Bag of Words

Sample images from categories

One of the fundamental problems that computer vision aims to solve is the ability to recognize objects and scenes in images. The idea is simple: given an image of a scene, return a meaningful value as to what that scene represents. As this problem is widely researched, many useful algorithms have been developed to aid in this task. This project aims to look at some of the methods used over the years to identify images within the 15 scene database as described in Lazebnik et al. 2006 . This database consists of 15 different scenes including: forest, kitchen, industrial, mountain, coast, and others. In each scene are 100 images, each with a correct label. The goal is to acheive 70% accuracy for all 1500 images.

Approaches

This project attempts 3 different approaches to identify a scene within an image.

  1. Tiny Images with Nearest Neighbor Classifier
  2. Bag of SIFTs with Nearest Neighbor Classifier
  3. Bag of SIFTs with SVM Classifier

Feature Representation

Tiny Images

Tiny Images is the simplest form of image representation. Input an image and output the same image, just scaled down to some fixed length. You lose a lot of information through this process, and it will change with differences in color or brightness, but it is simple to implement as shown here:


    result = zeros(size(image_paths, 1), 256);

    for i=1:size(image_paths,1)
        img = imread(image_paths{i});
        resized = imresize(img, [16 16]);    
        result(i, :) = resized(:);
    end

    image_feats = result;

Tiny Images along with the Nearest Neighbor classify acheived an accuracy of 20.5%.

Bag of SIFT

Like in project 2, SIFT features are not overly complicated ways of describing images using histograms of gradients. The Bag of SIFT features takes these one step further. Using training images, a vocabulary of SIFT descriptors is built. Then for each image, more SIFT features are found and each one is matched to its nearest SIFT feature in the vocabulary. Then, much like in the SIFT feature itself, a histogram is built consisting of what vocabulary SIFT features were found in each image. Let's look at what one of the histograms can actually look like.

Looking at the histogram, the most common vocabulary SIFT examples were found at around SIFT #75, SIFT #140, and SIFT #200. Thus, images that had similar results would be classified as the same scene.

Now knowing what these features look like, the code to create these features looks like this:


    BAGS_STEP_SIZE = 5;
    BIN_SIZE = 8;

    result = zeros(size(image_paths, 1), vocab_size);

    for i=1:size(image_paths, 1)
        % compute sift features
        img = imread(image_paths{i});
        [locations, SIFT] = vl_dsift(single(img),...
            'fast',...
            'step', BAGS_STEP_SIZE,...
            'size', BIN_SIZE);
    
        % match to closest cluster
        distances = vl_alldist2(double(SIFT), vocab');
        [matches, indicies] = min(distances');
    
        % compute histogram
        histogram = zeros(1, vocab_size);
        for j=1:size(indicies,2)
            histogram(indicies(j)) = histogram(indicies(j)) + 1;
        end
    
        % assign histogram to image_feats(i, :)
        result(i, :) = normr(histogram);
    end

    image_feats = result;

Classifier

Nearest Neighbor

Nearest Neighbor matching is a fast and easy method of classification that produces decent accuracies. The idea is to calculate the distance to each feature's cluster center, and simply choose the closest one. Code for this method can be seen here:


    distances = vl_alldist2(test_image_feats', train_image_feats', 'chi2')';
    [mins, indicies] = min(distances, [], 2);
    predicted_categories = train_labels(indicies);

Using Bag of SIFT and Nearest Neighbor Classification, achieved an accuracy of 52.0%.

SVM

Nearest neighbor works decently, but accuracies could be improved through the next iteration of the project by using a SVM Machine. An SVM Machine, however, makes binary decisions. It can be trained to classify an image as either a forest scene, or not a forest scene. Decisions need to be made for 15 scene categories though, so 15 different SVM Machine's need to be created. Each image is then run through every single classifier and whichever one outputs the highest confidence in its decision will be the final scene decision. The code using SVM Machine classification can be seen here:


    categories = unique(train_labels); 
    num_categories = length(categories);
    LAMBDA = 10;

    ws = zeros(15, size(train_image_feats, 2));
    bs = zeros(15, 1);

    for i=1:num_categories
        category = categories(i);
        binary = double(strcmp(category, train_labels));
        binary(binary == 0) = -1;
        [W B] = vl_svmtrain(train_image_feats', binary, LAMBDA);
        ws(i, :) = W;
        bs(i, :) = B;
    end

    result = train_labels;

    for i=1:size(test_image_feats, 1)
        confidences = dot(ws,...
            repmat(test_image_feats(i, :), num_categories, 1),...
            2) + bs;
        [match, index] = max(confidences);
        result(i) = categories(index);
    end
    
    predicted_categories = result;

Bag of SIFT features and an SVM classifer achieved an accuracy of 65.97%.

Parameter Values

Step Size

The step size is an important parameter in two locations. In building the vocabulary, the step size was chosen to be 25 as that was fairly large and allowed for sparse sampling of features from every training image. A step size of 5 was chosen in creating the bag of SIFT features as a more dense sampling is wanted from each image to create a more accurate histogram from each image.

Lambda

Lambda is a very important parameter when using vl_svmclassify. Considering accuracies between different runs can vary, each lambda value was run 10 times and an average taken. The best average lambda value was chosen to be used throughout all other runs. The resulting averages:

Given the results, a lambda value of 0.0001 was used throughout the project.

Extra Credit

Spatial Pyramid Features

Bag of SIFT features performed well, but like normal SIFT features, they lost all spatial information any image had. Some of this information can be preserved by splitting an image into a 4x4 grid and computing the histogram of SIFT features for each cell in the grid. This could be taken even further by splitting each image into a 16 x 16 grid and repeating the process. The smaller the grid though, the sparser the histograms will be thus limiting its effectiveness.

While keeping track of some of the spatial information is great, these features can be even more complete by storing every layer of spatial information into a single feature. This obviously has the disadvantage of using up more memory and being slow to run, but can offer significant increases in accuracies. The code for creating these spatial pyramid histograms is shown here:


    height_fourth = size(img, 1) / 4;
    width_fourth = size(img, 2) / 4;
    
    height_sixteen = size(img, 1) / 16;
    width_sixteen = size(img, 2) / 16;
    
    % compute histogram
    histogram_main = zeros(1, vocab_size);
    histogram_4 = zeros(4, 4, vocab_size);
    histogram_16 = zeros(16, 16, vocab_size);

    for j=1:size(indicies,2)
        location = locations(:, j);
        
        x4 = ceil(location(2) / height_fourth);
        y4 = ceil(location(1) / width_fourth);
        
        x16 = ceil(location(2) / height_sixteen);
        y16 = ceil(location(1) / width_sixteen);
        
        histogram_main(indicies(j)) = histogram_main(indicies(j)) + 1;
        histogram_4(x4, y4, indicies(j)) = histogram_4(x4, y4, indicies(j)) + 1;
        histogram_16(x16, y16, indicies(j)) = histogram_16(x16, y16, indicies(j)) + 1;
    end
    
    histogram = [histogram_main histogram_4(:)' histogram_16(:)'];

A 2 layer spatial pyramid can achieve an accuracy of 72.5%.

A 3 layer spatial pyramid can achieve an accuracy of 72.9%.

Overall, using a spatial pyramid for features improved accuracies almost 6 percent, however it should be noted that adding layers beyond the first extra layer didn't increase accuracies a meaningful amount, and took significantly longer to compute.

Final Results

Submitted Results

The code that was submitted had a couple changed to factor in running time. The bag of SIFT step size was increased to 7 and all extra credit was removed. Here are the resulting accuracies:

Feature Classification Accuracy
Tiny Images Nearest Neighbor 19.2%
Bag of SIFT Nearest Neighbor 50.5%
Bag of SIFT SVM 66.5%

Spatial Pyramid Bag of SIFT with SVM Classifier

Scene classification results visualization


Accuracy (mean of diagonal of confusion matrix) is 0.735

Category name Accuracy Sample training images Sample true positives False positives with true label False negatives with wrong predicted label
Kitchen 0.600
InsideCity

Store

Office

Store
Store 0.680
Kitchen

LivingRoom

Kitchen

Kitchen
Bedroom 0.650
Kitchen

LivingRoom

Industrial

Kitchen
LivingRoom 0.570
InsideCity

Industrial

Bedroom

Industrial
Office 0.890
Store

Kitchen

LivingRoom

LivingRoom
Industrial 0.510
Office

Bedroom

Kitchen

Forest
Suburb 0.920
Highway

Store

LivingRoom

InsideCity
InsideCity 0.540
Coast

LivingRoom

Street

Forest
TallBuilding 0.830
Mountain

InsideCity

Industrial

Industrial
Street 0.860
Office

InsideCity

InsideCity

Highway
Highway 0.790
Industrial

OpenCountry

Coast

Industrial
OpenCountry 0.600
InsideCity

Coast

Coast

Coast
Coast 0.840
OpenCountry

Highway

OpenCountry

Highway
Mountain 0.810
Forest

OpenCountry

Coast

Street
Forest 0.940
Store

Store

Mountain

Mountain
Category name Accuracy Sample training images Sample true positives False positives with true label False negatives with wrong predicted label