Recognition with Bag of Words

The objective of this project is to perform image recognition. The procedure of solving this problem includes two steps. One is to develop image descriptors and second is to develop image classifiers. Two kinds of image descriptors are implemented, including tiny images and bag of sift descriptor. Also two kinds of classifiers are implemented, including nearest neighbor and support vector machine. I will also show how the spatial pyramid can further improve the recognition accuracy and the influence of vocabulary size. To compare our implementation results with chance performance, Fig. 1 shows the confusion matrix for chance performance.

Implementation and Results

1. Tiny Image Representation and Nearest Neighbor Classfier

The tiny image feature is one of the simplest image representations. In this section, I resized each image to 16x16 by scaling its dimensions down to make the smaller dimension equal to 16 and then cropping the other dimension.

The nearest neighbor classifier assigns each test feature into a particular category by finding the "nearest" training feature and assigning the test feature into the category of that training feature.

The accuracy of this implementation is 19.7% as shown in Fig. 2.

Fig. 2 Accuracy (mean of diagonal of confusion matrix) is 19.7%

2. Bag of SIFT Representation and Nearest Neighbor Classifier

In this section, I used bag of SIFT as the feature descriptor and still used nearest neighbor classifier. This technique has two steps, including building vocabulary based on SIFT features and getting bags of SIFT features. The function vl_dsift from VL feat is used to find dense SIFT features.

In the first step, the step size for vl_dsift is 5. More features will be obtained if the step size is smaller. A larger step size can be used in this step. After obtaining all SIFT features from all training images, one hundred thousand features are randomly sampled. Then I built the vocabulary with the size of 200 by using k-mean clustering method and vl-kmeans function from VL feat.

In the second step, the step size for vl_dsift is still 5. Each training and testing image is represented by a histogram of the developed vocabulary. In each image, each SIFT feature is assigned to the "nearest" bin of the vocabulary and the final histogram for each image is normalized.

By implementing nearest neighbor classifier, the accuracy is 54.3%, as shown in Fig. 3(a), which is a big improvement from tiny image representation. When the step size in the second step is changed to 15, the computation time is about 100s and the accuracy is 46.9%, as shown in Fig. 3(b).

(a) (b)

Fig. 3 (a) Accuracy (mean of diagonal of confusion matrix) is 54.3%; (b) Accuracy (mean of diagonal of confusion matrix) is 46.9%.

3. Bag of SIFT Representation and Linear SVM Classifier

In this section, we changed nearest neighbor classifier to linear support vector machine (SVM) classifier. 1-vs-all SVM is used, so totally 15 linear classifiers are trained by using training data. During training each classifier, vl_svmtrain from VL feat was used with lambda = 0.0001.

Then, each test descriptor was evaluated by all the trained classifiers. The most confident classifier with the largest residual won and the corresponding category was assigned.

By implementing this algorithm, the accuracy is further improved 68.7%, as shown in Fig. 4(a). When the step size in deriving bags of SIFT features is changed to 15, the computation time is about 100s and the accuracy is 57.5%, as shown in Fig. 4(b).

(a) (b)

Fig. 3 (a) Accuracy (mean of diagonal of confusion matrix) is 68.7%; (b) Accuracy (mean of diagonal of confusion matrix) is 57.5%.

Extra

1. Spatial Pyramid

By adding spatial features, the image spatial information can be preserved with bag of SIFT representation and the recognition performance can be improved. I tested two classifiers, including nearest neighbor and linear SVM, and kept all parameters same. By adding one layer (4 bins) and forming 2-layer pyramid (5 bins), the accuracy for the nearest neighbor classifier and SVM is improved to 58% and 71.2 (Figs. 5(a) and 5(b)), respectively. By adding two layers (20 bins) and forming 3-layer pyramid (21 bins), the accuracy for the nearest neighbor classifier and SVM is improved to 59.7% and 74.1 (Figs. 5(c) and 5(d)), respectively.

(a) (b)

(c) (d)

Fig. 5 (a) Accuracy (mean of diagonal of confusion matrix) is 58.0%; (b) Accuracy (mean of diagonal of confusion matrix) is 71.2%; (c) Accuracy (mean of diagonal of confusion matrix) is 59.7%; (d) Accuracy (mean of diagonal of confusion matrix) is 74.1%.

2. Vocabulary Size

Different vocabulary sizes were tried, including 10, 20, 50, 100, 200, 500, 1000, and 10000. I experimented this by using the combination of bag of SIFT and SVM. The spatial pyramid has only 1 layer. The variation of accuracy in the first seven cases of vocabulary size is shown in Fig. 6. The confusion matrix and performance for each scene when vocabulary size is 1000 are shown in Figs. 7 and 8, respectively. It is found that computation time increases with larger vocabulary size. Although the accuracy improves with larger vocabulary sizes, there is an improvement limitation. From Fig. 6, size of 200 is a good choice because of relatively high accuracy and short computation time.

Fig. 6 Accuracy under different vocabulary sizes.

Fig. 7 Accuracy (mean of diagonal of confusion matrix) is 71.4%.

Category name	Accuracy	Sample training images	Sample true positives	False positives with true label		False negatives with wrong predicted label
Kitchen	0.650			LivingRoom	LivingRoom	Bedroom	Bedroom
Store	0.670			LivingRoom	Industrial	InsideCity	InsideCity
Bedroom	0.420			LivingRoom	InsideCity	TallBuilding	Office
LivingRoom	0.440			Bedroom	Bedroom	Store	Office
Office	0.950			Kitchen	Industrial	Bedroom	Kitchen
Industrial	0.590			Street	Store	TallBuilding	TallBuilding
Suburb	0.960			Mountain	InsideCity	InsideCity	InsideCity
InsideCity	0.580			Industrial	Street	Highway	Street
TallBuilding	0.830			Industrial	InsideCity	Coast	Street
Street	0.710			InsideCity	Industrial	Highway	InsideCity
Highway	0.840			Coast	Street	OpenCountry	OpenCountry
OpenCountry	0.480			Mountain	Highway	Coast	Coast
Coast	0.840			Forest	Industrial	Highway	Highway
Mountain	0.800			Forest	Forest	Suburb	Coast
Forest	0.950			OpenCountry	Mountain	Mountain	Store
Category name	Accuracy	Sample training images	Sample true positives	False positives with true label		False negatives with wrong predicted label

Fig. 8 Performance for different scenes in 15 categories.

Discussion and Conclusions

The scenes with the best and worst accuracy are slightly different for different descriptors and classifiers. For the case shown above, the scenes with the best accuracy in order are Suburb, Office, Forest, and Mountain and the scenes with the worse accuracy in order are LivingRoom, Industrial, Store, and OpenCountry. Some scenes with distinct textures usually have high accuracy, such as Office, Forest, and Mountain. It is difficult to differentiate among indoor scenes, such as Kitchen, LivingRoom, and Bedroom, because some objects in images are very similar. From the results I got, I can see that the classifier can be smarter with additional information. Besides, many other features can be considered to further improve the performance, such as self-similarity and gist. Also, larger dataset can be used to train the classifiers, such as SUN database.

Jun

Project 4 / Scene Recognition with Bag of Words