Project 4 / Scene Recognition with Bag of Words

Project 4 is all about attempting to label images based on other images that we know have a certain label. In this case, the label is the scene of the image (i.e. Forest, Highway, Coast). This can be a daunting task because images can vary so widely, and the general label can be correct for very different actual images.

There are two steps to this process. The first step is coming up with a feature to describe the image. We have learned in previous projects that features are a way of describing an point in an image such that it can hopefully be relatable to other points in similar images. In this project, we need a feature for an entire image in order to easily compare it to other images. Therefore, I implemented two possibilities of these that I will describe below.

The second step to this process is training a model of some sort in order to be able to classify images. This means that given features of images and known corresponding labels for the images(the training set), we want to be able to predict which label will apply to other images (the testing set, which we can calculate the feature of). There are also two possible implementations of this in this project that I will go into more detail about.

Image Representation Techniques

The first way that we can create features out of images is by making the image into a "tiny image". By resizing the image to be much smaller than the original (such as 16x16 as I have implemented). This hopefully means that you lose information that is too specific to that specific image, but keep more general information that would apply to all images of that type of scene. It doesn't work amazingly well in practice, but it is certainly better than randomly guessing.

The second way is by using a similar method to the one we did for features in project 2, SIFT features. We can sample SIFT features from an image and build a histogram of the frequency of said features in each image, and use the histogram as our overall image feature. One important note with this is that you need a way to separate the SIFT features, as no distinct SIFT feature is realistically going to appear more than once. We do this by creating a "vocabulary" of SIFT features, or essentially clusters of them in known images, and then place a given feature in an image into a bucket based on which "word" in the vocabulary it is the closest to. For my vocabulary, I used a size of 200, though I had attempted to make a larger vocabulary (as larger gives more buckets to possibly fit into and therefore generally more accurate) but my computer would lock up when trying to run the algorithm with the larger size. My build_vocabulary algorithm uses a step size of 3 (so sampling every 3rd SIFT feature) while my get_bags_of_sifts algorithm uses a step size of 8 (sampling every 8th SIFT feature).

Classification Techniques

The first classification technique is again similar to what we have done before, and that is the nearest neighbor algorithm. This model is extremely simple in that you loop through all of the features computed by the training set, compare those features to the feature of the test image you are trying to classify, and determine which one from the training set is the "closest". Then, you assign that image's label to the test image. This can have some issues if the features aren't well defined, or if you have any outlier images. There is an alternative of taking the mode of the k closest neighbor's label and assigning that instead of the closest label. For this project, I stuck to 1NN, which is assigning the closest training image's label to the testing image.

The second classification technique is a linear support vector machine. This technique essentially involved building a linear model mapping (think y = wx + b) where x is an image feature and y is a label. By building this model as best we can using the training set(calculating w and b), we can input a testing image feature and hopefully receive a correct label as output. For this project, I actually created a number of these models equal to the number of available categories in the training set, and use a binary labeling (1 for matching and -1 for not) for each category. This meant that we then had to take the outputs of all of the models and determine which was the most positive in order to determine what to actually label the test image. There is a varying parameter that is used in building each of the individual models, called lambda. In my case, I found the best lambda to be 0.001. I tried other lambdas in either direction (0.01, 0.0001) but these were not as successful as the lambda that I eventually settled on.


For this project, we were asked to calculate the results of our algorithms given a training and testing set of images for three groupings of the two steps of techniques required for scene recognition. You can see my results page for each of these groups by clicking on the appropriate link below.

As you can see, the accuracy gets better the more detailed we become. There is a massive jump from the first to the second group, as as I had said before, tiny images makes us lose too much information that SIFT allows us to keep. You can see that several scenes that don't look like each other at all are getting matched in the results, just strictly because they have similar color patterns in a general sense. With SIFT, it looks like at least some of the mismatches are at least in the same ballpark as the label they are assigned ("Office" label being assigned to a scene that is actually a "Bedroom"). That being said, the SIFT algorithm takes a lot longer to run (taking several minutes instead of just a few seconds) so that may be worth considering as well. I also feel like my SIFT would have performed better if I had been able to provide it with a larger vocabulary size. Right now, we may just not have enough detail to accurately label some of the scene types, and that would have been improved given more "words" to categorize our image's SIFT_features as.