Project 4 / Scene Recognition with Bag of Words

This figure is the most accurate confusion matrix. It was created with bag of SIFT and all linear SVM

This project required me to develop algorithms to understand image recognition. For example, different photos of kitches should all be categorized as a kitchen since it may have a stove, oven, fridge, etc. Unfortunatly, not all kitchens are created equally and due to different colored kitches, different placements of kitchen appliances, etc. it is actually a very difficult process for a computer to understand a kitchen 100 percent of the time like humans. We will use scene recognition methods using tiny images and nearest neighbor classification to more advanced methods like bags of quantized local features and linear classifiers learned by support vector machines.

Results

1. placeholder - random guessing

Accuracy (mean of diagonal of confusion matrix) is 0.067. This was acheived by running the project without any changes in the beginning. The reason for the small percentage is because of the random category guessing. Since their is no real algorithm behind guessing, the computer will, on average, guess around a .067 which is guessing 1 of the 15 images correctly.

2. tiny images + nearest neighbor

Accuracy (mean of diagonal of confusion matrix) is 0.201. This is obtained from the implementation of get_tiny_images and nearest_neighbor_classify. The tiny images function required me to take in the image that was passed in and resize it to a 16 by 16 square resolution. I went about resizing the image since i found it easiest since matlab has a imresize function that makes it really easy to implement. Once completed, I needed to reshape the image_feats so that once i ran through the entire size of the image path, I would have image_feats completely finished. Below will be the two main lines of code i used for the tiny image algorithm.

imageResized = imresize(image, [16 16]);
image_feats(index, :) = reshape(imageResized, 1, 256);

In the nearest_neighbor_classify method, I needed to guess which category the test image most aligned with the training image. I began by using the vl_alldist2 reccomended method to return the pairwise distance matrix of the columns. Once completed, it was necessary to loop through the entire test_image_feats because i needed to find the index with the smallest value inside of the distance matrix that i just made with vl_alldist2 to find the actual nearest neighbor. Once found, I placed it into a predicted_categories array to keep track of. below will be the smallest index algorithm I made.

smallest = inf;
smallestIndex = 0;
for current=1:size(test_image_feats,1)
    val = distance(index,current);
        if smallest > val
            smallest = val;
            smallestIndex = current;
        end
end

3. bag of SIFT + nearest neighbor

Accuracy (mean of diagonal of confusion matrix) is 0.478. This is obtained from the implementation of get_bags_of_sifts nearest_neighbor_classify, and build_vocabulary. This method uses the same nearest_neighbor_classify as above but now we are creating get_bags_of_sifts which is a lot more accurate than nearest neighbor. The bag of sifts is the same SIFT feature constructor except now we cluster them together so it is a lot more accurate. we begin by using vl_dsift while we interate through the image paths. vl_dsift was a reccommeneded built in function that extracts a set of SIFT features from the image. We then construct a feature array and then we construct all the distances from the features using vl_alldist2. Once done, we find the minimum before storing that value in a new array that holds all the features. Finally, we iterate through the size of the vocab until we find the largest value in the collected feature. Below will be a snippit of the code i used to calculate the feature values and the distance values.


for curVal=1:vocab_size
     featureValues(curVal,1) = secondParam(curVal,current);
end

for cur=1:size(vocab, 1)
    clusterValues = double(vocab(cur,:));
    allDistValues(cur) = vl_alldist2(featureValues, transpose(clusterValues));
end

The build vocabulary function that was also used in this algorithm will sample the SIFT descriptors, cluster them with kmeans, and returned them. I began by iterating through the image path. While iterating, I used the reccommended vl_dsift with a third parameter of 10 which means a larger spatial bin covering the size pixels according to the documentation online. Once finished iterating, we need to cluster the values now with kmeans. Below is the important line of code that clusters the values obtained from the iteration before it.


centroids = vl_kmeans(single(clusterValues), vocab_size);

4. bag of SIFT + all linear SVM

Accuracy (mean of diagonal of confusion matrix) is 0.618. This is obtained from the implementation of get_bags_of_sifts, svm_classify, and build_vocabulary. This final algorithm combination is the most accurate. The use of Bag of SIFT with the SVM is the best combination that produced the most accurate results. The reason for that is because the linear SVM is trained for all the categories and then use those classifiers to predict those categories. This is a lot more accurate then just using the nearest neighbor which is why this method had a large accuracy. We begin by iterating through all of the unique train labels that are passed into the function. Once we compare the current index of the unique categories with the train labels, we pass that value with a .0001 lambda value and the train_image_feats into vl_svmtrain which was reccommended. After testing lambda values from .01 to .00001, I noticed that .0001 gave me the highest accuracy. vl_svmtrain tells us which indices in the train labels match to a particular category. Once iteration is finished, we had to make a new array that contains all the values and populate it with the B values from the vl_svmtrain. Finally, we had to multiple with 'W' values with the test_image_feats and added the values we just got. Below will be a snippet of code that shows how the entire algorithm was formed.

...
for index=1:length(unique(train_labels))
    ...
    %[W B] = vl_svmtrain(features, labels, LAMBDA)
    [W, B] = vl_svmtrain(transpose(train_image_feats), labelValues, 0.0001);
    allWValues(:, index) = W;
    allBValues(1, index) = B;
end

values = zeros(15,15);
for current=1:size(test_image_feats, 1)
    for currentTwo=1:length(uniqueCategories)
        values(current,currentTwo) = allBValues(1,currentTwo);
    end
end

...

confidenceValues = allWValues * test_image_feats + values;
...
predicted_categories = uniqueCategories(secondParam);

Category name Accuracy Sample training images Sample true positives False positives with true label False negatives with wrong predicted label
Kitchen 0.460
Bedroom

LivingRoom

Office

Store
Store 0.700
InsideCity

Kitchen

Mountain

Street
Bedroom 0.480
LivingRoom

LivingRoom

Store

Store
LivingRoom 0.140
Bedroom

Bedroom

Bedroom

Street
Office 0.820
LivingRoom

Street

Suburb

Kitchen
Industrial 0.330
Bedroom

TallBuilding

TallBuilding

Mountain
Suburb 0.920
Mountain

Mountain

Street

OpenCountry
InsideCity 0.530
Industrial

Suburb

Kitchen

Store
TallBuilding 0.700
Industrial

Street

Street

Store
Street 0.630
LivingRoom

Mountain

Office

Store
Highway 0.710
OpenCountry

OpenCountry

Street

OpenCountry
OpenCountry 0.360
Coast

Coast

Bedroom

Coast
Coast 0.740
Highway

Highway

Suburb

OpenCountry
Mountain 0.800
Street

Coast

Coast

Coast
Forest 0.950
OpenCountry

TallBuilding

Mountain

Mountain
Category name Accuracy Sample training images Sample true positives False positives with true label False negatives with wrong predicted label

This project really helped my understanding image recognition algorithms. The best values was obvisouly the bag of sift feature with the linear SVM classifier. This algorithm gave me a great response of 61.8% accuracy. Overall, great project to help my learning with this subject.