Project 4 / Scene Recognition with Bag of Words


Table of Contents:

Motivation:

Our brains make vision easy. It takes minimal effort for us humans to distinct between a leopard and a jaguar. We can read words on signs such as yield and stop and intuitively understand their meaining. We can recognize another human as a human and can classify categories of animals. We are able to understand images. Computers are not able to do what our brains do for us. For a computer to recognize an image we will need to teach it. There are many methods to do this and we will explore two of them:

At the end of our project, we will be able to classify scenes into 1 of 15 categories by training and testing on the 15 scene database.

Project Outline:

Confusion Matrix:

We can generate a confusion matrix to visualize how often a test case from category $y$ was predicted to be category $x$. A confusion matrix looks as follows:

The matrix can take on values from 0 to 1, which can be identified through colors in the legend. In our project we will use the "jet" colormap where our spectrum of colors ranges from dark blue to light blue to green to yellow to orange to dark red. In the cart above we see that the diagonal is where we have correct predicted values, i.e. we predict zero to be zero, one to be one, ..., nine to be nine. By examining the legend, we see that the prediction of six being six was highest, as it has a dark shade of red and a value of 0.71. In our case, we will be predicting 15 categories ranging from kitches to forests. If we examine the mean of the diagonal, we can obtain an overall accuracy for all categories in our classification. The higher the overall accuracy, the better our predictions. Ideally, we want the diagonal to be dark red while other squares in the confusion matrix are dark blue. This indicates we have performed good classification. We will see this as we progress from random guessing to k-nearest-neighbors to svm.

We can visualize this even further by creating a table that has the category name, accuracy, sample training images, sample true positives, false positives with true label, and false negatives with wrong predicted label on the $x$ and $y$ axis. This allows us to get a further intuition with what is going on in our program. A sample of this looks as follows:

No Implementation:

Without implementing anything in our project, we will randomly guess the category of each test image. There are $15$ categories, so we have a $\frac{1}{15}$ chance or $\sim 6.67%$ chance of classifying the correct scene. Clicking the button below will pull up the results.

Guessing randomly, we see that we have an overall accuracy of $7.0$%, which is expected based on the probability we described above. Nothing has been implemented in this scenario, let us move forward and make some implementations to see how accuracy progresses in our classification.

Part 1:

Tiny Images: Tiny features were inspired by the simple images used as features in the 80 million tiny images data set for non-parametric object and scene recognition as shown below:

Tiny images are the baseline for our classification system. As they are trivial to implement, they help give us an intuition for the difficulty of the task and types of things that features need to latch onto to make scene classification decisions.

To create our tiny features we want to resize the original image to a very small square resolution such as 16 x 16.

Let us look at the output if we let our feature be tiny features with and without normalization:

Style Accuracy
Not Normalized 6.9%
Normalized 8.4%

We see that normalization gives us a slight boost in performance, however we are still randomly guessing categories. Normalization should help us marginally as we progress from random guesses to k-nearest-neightbors to svm. So, we will move forward with normalized tiny features. Let us now move forward with implementation of the nearest neighbors classifier.

Nearest Neighbor Classifier: K-nearest neighbor is a non-parametric method used for classification as well as for regression. In our purposes we will use the knn algorithm for classification. With knn classification, the output is a class membership. An object is classified by a majority vote of its neighbors, with the object being assigned to the class most common among its k nearest neighbors. K-nearest neighbors is a simple algorithm that stores all available cases and classifies new cases based on a similarity measure, or distance function. Some common distance measures are the euclidean distance, manhattan distance, and the minkowski distance. This nearest neighbor algorithm requires no training, can learn arbitrarily complex decision boundaries, and it trivially supports multiclass problems. It is quite vulnerable to training noise, though, which can be alleviated by voting based on the K nearest neighbors. Nearest neighbor classifiers also suffer as the feature dimensionality increases, because the classifier has no mechanism to learn which dimensions are irrelevant for the decision. The algorithm is as follows:

We can depict this graphically as follows:

Let us now implement knn in nearest_neighbor_classify.m:

Now, that we have implemented k-nearest neighbors the next natural question is what distance measure do we use and what k value do we use to get the highest accuracy. There are sophisticated methods to discover the optimal k-value, such as cross validation; however, we can run a simple test with a few different distance metrics against different k values and examine the accuracy. We will choose the distance measure that has the best accuracy for the k value. We want to look for an elbow in the graphs we generate. We will chose the maximum k value where peformance starts to not improve. Vl_alldist2 allows the following distance measures:

Where the exponent and division is element-wise.

The accuracys corresponding to k are as follows with k-nearest neighbors and tiny images :

k-plots


LINF


L2


L1


L0


CHI2


HELL

Examining the above plots, we see that we get the best accuracy with the distance metric CHI2 and with a k value of 3 with the tiny images. This gives us an accuracy of 22.8%. We note that we get similar slightly smaller accuracy values around 21% with the best k for L2, L1, L0, and HELL. LINF gives us the worst accuracy. We will proceed with the CHI2 distance metric and a k value of 3.

The confusion matrix and informative table can be viewed by clicking the button below:

Our baseline scene recognition pipeline, tiny images and nearest neighbors, is much better than the 7% accuracy obtained through random guessing. However, we can move on to a more sophisticated image representation, bags of quantized SIFT features. Let us do this next.

Part 2:

Vocabulary of Words: The bag of words model is a simplifying representation used in natural language processing and information retrieval. Here, a text is represented as the bag of its words, disregarding grammar and word order, but keeping multiplicity. This can be use for image classification in computer vision by treating image features as words. The bag of visual words is a vector of occurence counts of a vocabulary of local image features. The vocabulary of visual words will be the first step we need before we can represent our training and testing iamges as bag of feature histograms. To establish a vocabulary of visual words we will form a vocabulary by sampling many local features from our training set and then cluster them with kmeans. We will then return the cluster centers.

Our implementation of build_vocabulary is as follows:

  • Transpose: We will now get a 128 x vocab_size matrix and we perform a transpose so that we can return a vocab_size x 128 matrix.
  • Note that the above process takes 62.44566465 minutes to run. We can decrease the time taken by:

    However, once we compute the vocabulary.mat, we will store it so we do not need to recompute it each time we run our project.

    Bag of SIFTS:

    Let us now proceed to implementing get_bag_of_sifts.m:

    Let us now compare our accuracy with and without the 'fast' parameter, the step size and the time taken to run the program (note the vocab_size = 200 here):

    'fast' Step Size Accuracy Time
    Yes 10 53.6% 124.886461 s
    Yes 8 56.4% 178.570186 s
    Yes 5 59.2% 438.831639 s
    No 10 54.3% 267.982072 s
    No 8 56.7% 322.848910 s
    No 5 59.2% 591.709452 s

    Looking at the table above, we see that we achieve a top accuracy of 59.2% with and without the 'fast' parameter. However, the computational expense was very high. The graphs for our top accuracy of no 'fast' parameter and a step size of 5 are displayed through the button below:

    However, for slightly less accuracy and much less computational expense, we can use the 'fast' parameter and a step size of 8. The charts for this are shown below:

    Using a step size of 3 with and without the 'fast' parameter will give us great accuracy; however, the computational expense is very high. Thus, based on reasonable speed, we get the best accuracy with the 'fast' parameter and a step size of 8 giving us 56.4% accuracy. This is within our target range of 50-60% accuracy. We will use this in our implementation going forward.

    Let us now move onto part 3: Linear SVM Classifier.

    Part 3:

    Linear SVM Classifier: A support vector machine (SVM) is a supervised machine learning algorithm which can be used for classification. With this algorithm, we plot each data point as a point in a d-dimensional space, where d represents the number of features we possess. We can perform classification by finding the hyper-plane that differentiates the two classes best. However, how do we define which hyper-plane is best? let us look at a few different scenarios:

    Scenario 1:

    In Scenario 1, there are three hyper-planes: A,B, and C. We want to classify starts from circles, and it is easy to see that B does this best.

    Scenario 2:

    Here, all three hyper-planes separate the two classes well. But, which one is best? We need to look at maximizing the distance (called the margin) between the nearest data point in either class. We see that the star is closest to B and a little further from C and the closest circle is closest to A. C has a high margin compared to A and B. Thus, we select the hyper-plane C. Selecting a hyper-plane of low margin could give us a high chance of misclassification.

    Scenario 3:

    Here we see that there is an outlier. Will SVM be able to separate the classes? Yes, SVM can ignore outliers.

    All in all, given a training dataset of n data points $(x_1,y_1),...,(x_n,y_n)$ where the $y_i$ are 1 or -1 indicating the class to which the p-dimensional vector $x_i$ belongs. We want to find the maximum-margin hyperplane that divides the group of points $x_i$ for which $y_i=1$ from the group of points for which $y_i=-1$. This is defined so that the distance between the hyperplane and the nearest point $x_i$ from either group is maximized. We can write any hyper-pane as the set of points $x$ satisfyinh $w \cdot x - b = 0$, where $w$ is the normal vector to the hyper-plane.

    Let us now implement the linear SVM classifier in svm_classify.m. We want a function that will train a linear SVM for every category, one v.s. all, and then use the learned classifiers to predict the category of every test image. Every test feature will be evaulated with all 15 SVMs and the most confident SVM will win. We will let confidence, or distance from the margin, be $W \cdot X + B$, where W and B are the learned hyperplane parameters. Our implementation is as follows:

    Using LAMBDA = 0.0003 we obtain an accuracy of 63.4% (note that upon running the program there might be slight deviation of a few hundreths but should be very close in accuracy to this number). The charts are as follows:

    We note that LAMBDA has strong effects on our accuracy. Let us take a look at accuracys corresponding to a range of lambdas in the table below:

    LAMBDA Accuracy
    0.00001 58.87%
    0.0001 61.73%
    0.001 62.80%
    0.01 50.87%
    0.1 38.80%
    1 44.47%
    10 42.27%

    Examining the table above, we can see how important LAMBDA is to vl_svmtrain(). Using a LAMBDA between 0.0001 and 0.0003 gave me higher accuracy, however 0.0003 gave me the highest accuracy of 63.4%.

    So Far:

    Let us see how we progressed in terms of accuracy so far:

    No Implementation

    With randomly guessing we achieved an overall accuracy of 6.8%.

    Tiny Images & Nearest Neighbors

    With normalized tiny features, K=3, and a distance metric of CHI2, we achieved an overall accuracy of 22.8%.

    Bag of SIFT representation & Nearest Neighbor

    Using a vocab_size = 200, a step_size = 3 in build_vocabulary, and the 'fast' parameter and a step size of 8 in vl_dsift in get_bag_of_sifts we achieved an overall accuracy of 56.4%.

    Bag of SIFT representation and linear SVM classifier

    With the 'fast' parameter and a step_size=8 in get_bags_of_sifts and a LAMBDA=0.0003 and a Loss='HINGE2' in svm_classify we achieved an overall accuracy of 63.4%.

    Graduate Extra Credit:

    Vocabulary Size: Previously we used a vocabulay size of 200. We obtained the highest accuracy with the Bag of SIFT representation and linear SVM classifier so let us test the different vocabulary sizes on this pipeline with a LAMBDA=0.0003 and a step size in the build_vocabulary of 3 and the 'fast' parameter.

    vocab_size       charts Accuracy
    10
    43.9%
    20
    55.5%
    50
    60.7%
    100
    63.4%
    200
    65.1%
    400
    67.1%
    1000
    67.6%

    We see that the larger the vocab_size we use the better our results are and the more computational expense we incur, for example with a vocabulary size of 400 it took 5326.331882 seconds. The 0.5 increase in accuracy from 400 to 1000 is not worth the additional computational expense in my opinion. Thus, using a vocab size of between 200 and 400 would be ideal in terms of accuracy and computational expense.

    Cross Validation: We can use cross-validation to measure performance rather than the fixed test/train split provided in our starter code. What we want to do is randomly pick 100 training and 100 testing images for each iteration and report the average performance and standard deviations. To implement this we randomly generate 100 numbers (sampling with replacement) and choose these as the indices of our image_paths instead of iterating over all image_paths. We can add our proj4.m code in a loop and run it 10 times and we get the following accuracies: .632, .628, .631, .620, .618, .631, .622, .631, .630, .628. The average of the accuracies is 0.6271. The standard deviation of the accuracies is 0.00515213. We can look at a histogram of these accuracies below:

    We see that this is very close to the accuracy we obtained without randomly choosing 100 training and 100 testing images for each iteration.

    SUN database: The SUN database at its core has entries that correspond to names of scenes, palces, and environments (any concrete noun which could reasonably complete the phrase I am in a place, or let's go to the place), using WordNet English dictionary. There are 397 scene categories contained in a 37GB file. If we use this data in our data directory and tweak our code for to give us accuracy from the confusion matrix (not plotted since extremly large in length and width) we get an accuracy of 43.2%. It seems that this should be about right for using bag of sift and linear SVM classifier with our settings as in the above section due to the sheer number of categories it contains. Even with extremly finely tuned parameter settings I would see it difficult to reach above 55% accuracy. I omitted the confusion matrix as it is difficult to read as there are 157,609 cells we would need to observe at one time. I believe that with deap learning we would be able to get upwards of 60-70%%. We note that running this dataset is a huge computational expense at every step in our pipeline. This dataset took an enormous amount of time to run on a well equipped machine. The data from this experiment can be downloaded at this link.

    Fisher :

    We could use the more sophisticated fisher encoding scheme analyzed in the comparative study of Chatfield et al. The fisher vector is an image representation obtained by pooling local image features. Let us describe the process. Let $I = (x_1,x_2,...,x_n)$ be a $D$ dimensional set of SIFT descriptors extracted from an image. Let $\theta = (\mu_k,\Sigma_k,\pi_k: k = 1,...,K)$ be the parameters of a Gaussian Mixture Model fitting the distribution of descriptors. We can get the posterior probability:

    $$q_{ik} = \frac{\text{exp}\left(\frac{-1}{2}(x_i-\mu_k)^T\left(\Sigma_k\right)^{-1}(x_i-\mu_k)\right)}{\sum_{t=1}^K \text{exp}\left((x_i-\mu_t)^T\left(\Sigma_k\right)^{-1}(x_i-\mu_t)\right)}$$

    If we look at each mode represented by $k$ we can see that the mean vector is:

    $$\mu_{jk} = \frac{1}{n\sqrt{\pi_k}}\sum_{i=1}^n q_{ik}\frac{x_{ji}-\mu_{jk}}{\sigma_{jk}}$$

    and we see that the covariance vector is:

    $$v_{jk} = \frac{1}{n\sqrt{2\pi_k}}\sum_{i=1}^n q_{ik} \left(\left(\frac{x_{ji} - \mu_{jk}}{\sigma_{jk}}\right)^2 - 1 \right)$$

    where $j=1,...,D$ spans the vector dimensions. The fisher vector is then $\theta(I) = [...,\mu_k ..., v_k, ...]^T$.

    The above derivation looks complicated, but we can make use of the vl package to help us with our implementation.

    We create a function called fisher which takes in the train_image_paths and the test_image_paths. It returns the train_image_feats and the test_image_feats. We first iterate over the number of rows in the training paths and use vl_dsift with the single data type image after it has been read in and with the fast parameter and a step size of 5. From this we get sift_feats and we can calculate our features as [features s_sift_feats] where s_sift_feats is a single cast of sift_fets. We then use vl_gmm on features and 30 clusters to get the means, covariances, and priors. We then create two empty arrays: train_image_feats and test_image_feats. We iterate over the number of rows in the train image path and call cl_dsift on both train_sift_feats and test_sift_feat with the fast parameter and a step size of 5. We then use vl_fisher on both of these variables with the means, covariances, and priors. We then transpose these and represent these by the variables enc1 and enc2. train_image_feats then equates to [train_image_feats; enc1] and test_image_feats equates to [test_image_feats; enc2]. We then return these two variabels and can proceed to run linear SVM classification.

    We use a step size of 5 because it is fairly dense and runs in a reasonable amount of time. Decreasing the step size will increase the accuracy; however, it will increase the program's run time. After experimenting with different step sizes, I found 5 to be a nice fit between accuracy and computational expense.

    Let us now run fisher with linear svm classification with a LAMBDA = .00003 and a step size of 5. We obtain an accuracy of 71.7%. However, if we add the parameter 'normalized' to enc1 and enc2 we get an accuracy of 72.2% The charts are found below:

    We chose LAMBDA = .00003 after experimenting with different values. Previously in Part 3 we used 0.0003, however this gave us poor results. The LAMBDA above has given us the best accuracy.

    The above was run with a vocab_size = 200. Let us run this with a vocab_size = 400:

    We arrive at an accuracy of 72.7%. The charts can be displayed below:

    Conclusion:

    In conclusion, we were able to obtain a maximum accuracy of 72.7% after implementing fisher and running it with the linear SVM classifier with a vocab_size of 400, 30 clusters in vl_gmm, step sizes of 5 in fisher.m, and a lambda of 0.00003 in svm_classify. Overall, we saw that our accuracy increased as we advanced from part 1 to part 3.

    Sources: