An example of a typical bag of words classification pipeline. Figure by Chatfield et al.

Project 4: Scene recognition with bag of words
CS 6476: Computer Vision



The goal of this project is to introduce you to image recognition. Specifically, we will examine the task of scene recognition starting with very simple methods -- tiny images and nearest neighbor classification -- and then move on to more advanced methods -- bags of quantized local features and linear classifiers learned by support vector machines.

Bag of words models are a popular technique for image classification inspired by models used in natural language processing. The model ignores or downplays word arrangement (spatial information in the image) and classifies based on a histogram of the frequency of visual words. The visual word "vocabulary" is established by clustering a large corpus of local features. See Szeliski chapter 14.4.1 for more details on category recognition with quantized features. In addition, 14.3.2 discusses vocabulary creation and 14.1 covers classification techniques.

For this project you will be implementing a basic bag of words model with many opportunities for extra credit. You will classify scenes into one of 15 categories by training and testing on the 15 scene database (introduced in Lazebnik et al. 2006, although built on top of previously published datasets). Lazebnik et al. 2006 is a great paper to read, although we will be implementing the baseline method the paper discusses (equivalent to the zero level pyramid) and not the more sophisticated spatial pyramid (which is extra credit). For an excellent survey of pre-deep-learning feature encoding methods for bag of words models see Chatfield et al, 2011.

Example scenes from of each category in the 15 scene dataset. Figure from Lazebnik et al. 2006.


(We will create a new environment named cs6476p4 with a lot more dependencies compared to previous projects!)

  1. Install Miniconda. It doesn't matter whether you use 2.7 or 3.6 because we will create our own environment anyways.
  2. Create a conda environment using the appropriate command. On Windows, open the installed "Conda prompt" to run this command. On MacOS or Linux, you can just use a terminal window to run the command.
    conda env create -f environment_<OS>.yml
  3. This should create an environment named cs6476p4. Activate it using the follow Windows command:
    activate cs6476p4
    or the following MacOS/Linux command:
    source activate cs6476p4
  4. Run the notebook using:
    jupyter notebook ./code/proj4.ipynb
  5. Generate the submission once you've finished the project using:

Details and Starter Code

The primary script for this project is The ipython notebook proj4.ipynb runs through the function calls required to implement the project. Your task is to implement the methods within the primary script that will allow you to achieve the desired accuracies on the notebook.

You are required to implement 2 different image representations -- tiny images and bags of SIFT features -- and 2 different classification techniques -- nearest neighbor and linear SVM. In the writeup, you are specifically asked to report performance for the following combinations, and it is also highly recommended that you implement them in this order:

You will start by implementing the tiny image representation and the nearest neighbor classifier. They are easy to understand, easy to implement, and runs very quickly for our experimental setup (less than 10 seconds).

The "tiny image" feature, inspired by the work of the same name by Torralba, Fergus, and Freeman, is one of the simplest possible image representations. One simply resizes each image to a small, fixed resolution (we recommend 16x16). It works slightly better if the tiny image is made to have zero mean and unit length. This is not a particularly good representation, because it discards all of the high frequency image content and is not especially invariant to spatial or brightness shifts. Torralba, Fergus, and Freeman propose several alignment methods to alleviate the latter drawback, but we will not worry about alignment for this project. We are using tiny images simply as a baseline. See get_tiny_images() in the starter code for more details.

The nearest neighbor classifier is equally simple to understand. When tasked with classifying a test feature into a particular category, one simply finds the "nearest" training example (L2 distance is a sufficient metric) and assigns the test case the label of that nearest training example. The nearest neighbor classifier has many desirable features -- it requires no training, it can learn arbitrarily complex decision boundaries, and it trivially supports multiclass problems. It is quite vulnerable to training noise, though, which can be alleviated by voting based on the K nearest neighbors (but you are not required to do so). Nearest neighbor classifiers also suffer as the feature dimensionality increases, because the classifier has no mechanism to learn which dimensions are irrelevant for the decision. The nearest neighbor computation also becomes slow for high dimensional data and many training examples. See nearest_neighbor_classify() for more details.

Together, the tiny image representation and nearest neighbor classifier will get about 15% to 25% accuracy on the 15 scene database. For comparison, chance performance is ~7%.

After you have implemented a baseline scene recognition pipeline it is time to move on to a more sophisticated image representation -- bags of quantized SIFT features. Before we can represent our training and testing images as bag of feature histograms, we first need to establish a vocabulary of visual words. We will form this vocabulary by sampling many local features from our training set (10's or 100's of thousands) and then clustering them with kmeans. The number of kmeans clusters is the size of our vocabulary and the size of our features. For example, you might start by clustering many SIFT descriptors into k=50 clusters. This partitions the continuous, 128 dimensional SIFT feature space into 50 regions. For any new SIFT feature we observe, we can figure out which region it belongs to as long as we save the centroids of our original clusters. Those centroids are our visual word vocabulary. Because it can be slow to sample and cluster many local features, the starter code saves the cluster centroids and avoids recomputing them on future runs. See build_vocabulary() for more details.

Now we are ready to represent our training and testing images as histograms of visual words. For each image we will densely sample many SIFT descriptors. Instead of storing hundreds of SIFT descriptors, we simply count how many SIFT descriptors fall into each cluster in our visual word vocabulary. This is done by finding the nearest neighbor kmeans centroid for every SIFT feature. Thus, if we have a vocabulary of 50 visual words, and we detect 220 SIFT features in an image, our bag of SIFT representation will be a histogram of 50 dimensions where each bin counts how many times a SIFT descriptor was assigned to that cluster and sums to 220. The histogram should be normalized so that image size does not dramatically change the bag of feature magnitude. See get_bags_of_sifts() for more details.

You should now measure how well your bag of SIFT representation works when paired with a nearest neighbor classifier. There are many design decisions and free parameters for the bag of SIFT representation (number of clusters, sampling density, sampling scales, SIFT parameters, etc.) so accuracy might vary from 50% to 60%.

The last task is to train 1-vs-all linear SVMS to operate in the bag of SIFT feature space. Linear classifiers are one of the simplest possible learning models. The feature space is partitioned by a learned hyperplane and test cases are categorized based on which side of that hyperplane they fall on. Despite this model being far less expressive than the nearest neighbor classifier, it will often perform better. For example, maybe in our bag of SIFT representation 40 of the 50 visual words are uninformative. They simply don't help us make a decision about whether an image is a 'forest' or a 'bedroom'. Perhaps they represent smooth patches, gradients, or step edges which occur in all types of scenes. The prediction from a nearest neighbor classifier will still be heavily influenced by these frequent visual words, whereas a linear classifier can learn that those dimensions of the feature vector are less relevant and thus downweight them when making a decision. There are numerous methods to learn linear classifiers but we will find linear decision boundaries with a support vector machine. You do not have to implement the support vector machine. However, linear classifiers are inherently binary and we have a 15-way classification problem. To decide which of 15 categories a test case belongs to, you will train 15 binary, 1-vs-all SVMs. 1-vs-all means that each classifier will be trained to recognize 'forest' vs 'non-forest', 'kitchen' vs 'non-kitchen', etc. All 15 classifiers will be evaluated on each test case and the classifier which is most confidently positive "wins". E.g. if the 'kitchen' classifier returns a score of -0.2 (where 0 is on the decision boundary), and the 'forest' classifier returns a score of -0.3, and all of the other classifiers are even more negative, the test case would be classified as a kitchen even though none of the classifiers put the test case on the positive side of the decision boundary. When learning an SVM, you have a free parameter 'lambda' which controls how strongly regularized the model is. Your accuracy will be very sensitive to lambda, so be sure to test many values. See svm_classify() for more details.

Now you can evaluate the bag of SIFT representation paired with 1-vs-all linear SVMs. Accuracy should be from 60% to 70% depending on the parameters. You can do better still if you implement extra credit suggestions below.

The starter code, starting from contains more concrete guidance on the inputs, outputs, and suggested strategies for the five functions you will implement: get_tiny_images(), nearest_neighbor_classify(), build_vocabulary(), get_bags_of_sifts(), and svm_classify(). The utils folder contains code required to visualize the results on the notebook, which you are not required to edit.

Experimental Design

An important aspect of machine learning is to estimate "good" hyper-parameters. As part of this project, you will perform the following three experiments and analyze the results in your writeup:

Evaluation and Visualization

The starter code builds a confusion matrix and visualizes your classification decisions by producing a confusion matrix between the ground truth test labels and the predicted test labels within the notebook, each time you run


The starter codes trains and tests on 100 images from each category (i.e. 1500 training examples total and 1500 test cases total). Unzip the provided and place the 'data/' folder alongside 'code/' and 'html/'. In a real research paper, one would be expected to test performance on random splits of the data into training and test sets, but the starter code does not do this to ease debugging.

Useful Functions

The starter code contains more complete discussions of useful functions from Scikit-Learn utilities to VLFeat commands.

Write up

For this project, and all other projects, you must do a project report in HTML. In the report you will describe your algorithm and any decisions you made to write your algorithm a particular way. Use the experimental design results to aid your analysis. Analysis must be more than just a presentation of results or narration of steps, and can include things like explanations for why design choices affect results in certain ways. Discuss any extra credit you did and show what contribution it had on the results (e.g. performance with and without each extra credit component).

You are required to report the accuracy you achieved for the three recognition pipelines above (tiny images + nearest neighbor, bag of SIFT + nearest neighbor, and bag of SIFT + 1 vs all linear SVM). The accuracy number reported by the starter code -- the average of the diagonal of the confusion matrix -- will suffice. However, for your best performing recognition setup you should include the full confusion matrix. Simply copy the html and images into your writeup. Do not include code or your GTID in your report.

Extra Credit

For all extra credit, be sure to include quantitative analysis showing the impact of the particular method you've implemented. Each item is "up to" some amount of points because trivial implementations may not be worthy of full extra credit. Most of the extra credit focuses on the final bag of words + SVM pipeline of the project, not the baseline tiny image and nearest neighbor methods.

Feature representation extra credit:

Feature quantization and bag of words extra credit:

Classifier extra credit:

Spatial Pyramid representation and classifier:

Experimental design extra credit:

Deep learning. We don't recommend exploring deep learning methods as extra credit because that will be the focus of project 6.

Finally, there will be extra credit and recognition for the student who achieves the highest recognition rate. Deep feature representations are excluded from this competition.

Web-Publishing Results

All the results for each project will be put on the course website so that the students can see each other's results. In class we will highlight the best projects as determined by the professor and TAs. If you do not want your results published to the web, you can choose to opt out.

Handing in

This is very important as you will lose points if you do not follow instructions. Every time you do not follow instructions, you will lose 5 points. The folder you hand in must contain the following:

Hand in your project as a zip file through Canvas. This zip must be less than 5MB. If images are taking up too much space, you can use things like imagemagick to shrink them. Please Note that when grading, we will not be using your notebook, so make sure extra credit is thoroughly documented in the README, and the function signatures in are unchanged.


Final Advice

Extracting features, clustering to build a universal dictionary, and building histograms from features can be slow. A good implementation can run the entire pipeline in less than 10 minutes, but this may be at the expense of accuracy (e.g. too small a vocabulary of visual words or too sparse a sampling rate).


This project is based on a project from Aaron Bobick's offering of 4495 and has been expanded and edited by Sainandan Ramakrishnan, Vijay Upadhya, Samarth Brahmbhatt and James Hays.