One of the hallmarks of human vision is our ability to recognize a large number of objects and scenes under wide number of conditions. The project will explore several computational approaches to this problem. There are a few themes to keep in mind when evaluating the following methods. How are the images represented? Is this representation lossy? Given an image representation, how is this used to actually classify this image? How long do these methods take? Once these factors are understood, choosing a method for an actual application becomes easier and more methodic.
The dataset that will be used for this project will consist of 3000 images of 15 difference scenes. It should be noted that for modern methods this data set is tiny tiny tiny, but for the context of the methods we'll be using it provides suffient data for training and testing.
When deciding on features, we want to choose a space that is expressive enough to differentiate between dissimilar scenes but also not too expressive to the point of being attune to irrelevant details. One such approach is to simply resize an image to some small, standardized size, throwing away the high frequencies in the image and just keeping the most general shapes. To account for differences in lighting and scale, the small image can then be zero centered and normalized. In my implementation, I resize every image to a 24x24 blob, centering the image around zero and setting the variance to one.
The pros of this approach are clear: it's easy to understand and implement, and it's fast. The disadvantage is that we're throwing away a large amount of information in the form of medium and high frequencies, and it's not hard to visualize why dealing with only 24x24 images would make classifying scenes difficult, even to the human eye. Still, as we will see, this approach provides a decent baseline, exceeding random chance and coming somewhat close to more involved representations.
The bag of words approach is a bit more involved than using tiny images. The basic idea is that by sampling features throughout an image, we can build a so called dictionary of visual words, allowing us to represent an image by the visual words it contains. This provides a compact representation without being too low level. The questions that have to be answered in this approach are:
Several methods exist to extract features, such as SIFT, gist, and MSER. For this project, I used 128 dimensional SIFT features. Further modifications can be made to retain some sense of spatial layout, but for the sake of time I only look at the SIFT features of the entire image.
To build the visual dictionary, I sample SIFT features from many different images. I then cluster these features, using the centroids as my visual words. One thing I do to increase accuracy is sample features at different scales of a guassian pyramid. Another approach I use is to sample only 80% of features from a specific image to save time. Perhaps the most important decision in this process is the number of visual words. To determine the best, I used the entire pipeline with increasing number of visual words, starting at 10 and ending at 1200. I discuss the results later.
Once the visual words have been created, we then have to decide how to assign visual words to an image. To do this, I extract the SIFT features of the image and build a histogram detailing the frequency of the words found. I made two minor modifications to the typical binning approach to increase the accuracy and speed. First, instead of just finding the closest visual word, I find the five closest visual words and weight their votes accordingly. Specifically, the closest word gets a full vote, the second a half, the third a half of a half, and so on. I also find the shortest distances using a KD-tree instead of through brute force, which speeds up the process slightly (normally a minute or two depending on the vocab size).
Now that we've discussed how to represent our data set, we have to decide how to go about classifying a specific image. Perhaps the most intuitive approach is to find the closest match of the test image within the training image, giving the test image the same label. To match images, I take the features of an image (whether that be the tiny image approach or a bag of words) and use the euclidean distance metric to find its closest hombre. The advantage of this is that it's fast (no training time essentially) and can express any arbitrary complexity of the data (given an infinite amount of data mind you). The downside is that it's prone to outliers. One way to address this is to use the K neighest neighbors instead of 1 and let them vote on the scene. The scene with the most votes wins.
While the nearest neighbor approach gives okay results, SVM's in practice greatly outperform them for classification tasks such as scene recognition (in general). Going into the workings of SVM is beyond the scope of this report (yay for dual optimization and kernel tricks), it suffices to say they maximize the margins between the different classes. SVM's are inherently binary, so to deal with the fifteen total labels, we use the one-v-all approach. This means I train 15 different classifiers and choose the result that gives the greatest accuracy.
As expected, the bag of features approach coupled with a linear kernel support vector machine provided by the best accuracy. The tiny images with nearest neighbors gave a modest bump from random, but the bag of features gave a major boost, even with the NN. I show the confusion matrix of the linear SVM and some examples of the decision made below, and the results appear pretty intuitive. Distinctive scenes like highways and forests are easy to classify, but something like a bedroom or living room got relatively low accuracy.
A note about hyperparameters. I dedicate a section to this after the results section, but to touch on the difference between the optimal results shown below and the ones shown that can be obtained in less than 10 minutes, the only difference is the vocab size. I used a vocab size of 50 for the fast results and a vocab size of 400 for the fast.
Random | Tiny images Representation, NN Classifer | Bag of Features, NN Classifer | Bag of Features, Linear SVM |
8 % | 22.7 % | 53.8 % | 71.2 % |
Random | Tiny images Representation, NN Classifer | Bag of Features, NN Classifer | Bag of Features, Linear SVM |
8 % | 22.7 % | 48.7 % | 60.3 % |
Category name | Accuracy | Sample training images | Sample true positives | False positives with true label | False negatives with wrong predicted label | ||||
---|---|---|---|---|---|---|---|---|---|
Kitchen | 0.570 | Bedroom |
LivingRoom |
Store |
Store |
||||
Store | 0.630 | LivingRoom |
LivingRoom |
Kitchen |
Street |
||||
Bedroom | 0.540 | Store |
Kitchen |
Kitchen |
Industrial |
||||
LivingRoom | 0.390 | InsideCity |
Bedroom |
Bedroom |
Store |
||||
Office | 0.890 | Kitchen |
Bedroom |
Suburb |
LivingRoom |
||||
Industrial | 0.550 | Highway |
Bedroom |
InsideCity |
TallBuilding |
||||
Suburb | 0.960 | Mountain |
Industrial |
InsideCity |
Store |
||||
InsideCity | 0.650 | TallBuilding |
Kitchen |
Store |
Store |
||||
TallBuilding | 0.740 | LivingRoom |
Mountain |
Industrial |
Office |
||||
Street | 0.800 | Industrial |
OpenCountry |
Highway |
Industrial |
||||
Highway | 0.820 | Store |
Street |
Mountain |
Coast |
||||
OpenCountry | 0.560 | Coast |
Highway |
Coast |
Coast |
||||
Coast | 0.790 | Mountain |
Highway |
Highway |
OpenCountry |
||||
Mountain | 0.850 | OpenCountry |
Forest |
Coast |
LivingRoom |
||||
Forest | 0.940 | Mountain |
Store |
Mountain |
Store |
The trend of the vocab is fairly intuitive. As it increases, the classification accuracy increases as well until around 400, where it levels out. The time increases as well, being especially prohibitive at around 600 words. Going with something around 200 gives the best tradeoff between time and accuracy.
Vocabulary Size | 10 | 50 | 100 | 200 | 400 | 600 | 1200 |
Classification Accuracy | 38% | 61% | 66% | 68% | 71% | 71% | 72% |
Time | 8 min | 9 min | 10 min | 11 min | 14 min | 20 min | 30 Min |
The value of K has some effect on the accuracy, giving a boost of a few points. But it's nothing particulary significant.
K with 400 Vocab Size | 1 | 5 | 15 | 30 | 50 |
Classification Accuracy | 51% | 53% | 53% | 50% | 49% |