Recognition with Bag of Words

Project 4 / Scene Recognition with Bag of Words

First, to see my results, from the three required methods, see tiny images with a single nearest neighbor model, sift features with a single nearest neighbor model, and sift features with a support vector machine. Their accuracies, as can be seen on each individual page, were 19.1%, 54.2%, and 67.0%, respectively. This is within or above the expected ranges for each version. All three run within the 10 minute required time, and in fact I will talk more about the choices made in my third implementation that allowed it to perform so well.

The first version I implemented was the basic tiny images and knn classifier. This is a relatively simple pipeline (the entire implementation takes only 6 lines of code). It simply shrinks the images and then finds the nearest neighbor to each shrunken image. I left this as 1-nn, since k-nn implementations did not seem to have any notable effect on accuracy.

The next attempt was implementing SIFT features. This had two parts, first: creating a vocabulary of features, and second getting the 'bag of sifts' for each image and comparing them. The first attempt compared, once again, using 1-nn. My vocabulary of sift features was constructed with a step size of 25, using the 'fast' version. This led to a quick feature creation process, an order of magnitude faster than further experimentation with 'fast' disabled. My first run got aroudn 55% accuracy, and that seemed to be a fairly stable number with these values.

It was at this point that I implemented an SVM. Once I had the implementation, I experimented with various values of lambda on tiny images with my SVM classifier. I found .00001 to be the best value that I tried, and it had the effect of increasing my accuracy by almost a factor of two, from aroudn 10% to over 20% on the tiny images SVM combination. This was also the value that I used on my best scoring implementation.

On my SIFT+SVM implementation, I used the same settings as in previous examples ('fast', 25 and 5 for SIFT, .00001 lambda). This gave the previously mentioned accuracy of 67%. Other values of lambda could reduce this by as much as 20% (to 48% accuracy), but I could not get any improvement. However, removing the 'fast' setting did get modest increase in performance, to 67.1%. I expect that the values of 25 and 5 worked well together, and I also expect that I got somewhat lucky. I seemed to be in a local optimum in my sift param sizes.

Joshua Morton

Project 4 / Scene Recognition with Bag of Words