Recognition with Bag of Words

Project 4 / Scene Recognition with Bag of Words

In this project we will examine some of the most simple methods to recognize scenes. Our specific task is to assign an accurate label to each scene: "forest", "bedroom", etc.

The project is comprised of 3 parts with each part demonstrating an approach slightly complicated than the previous ones. Throughout the 3 parts we will see that the accuracy of the algorithms improves from around 20% to 53% and ultimately to 68%.

There are two important things when it comes to image recognition: image representation and image classifications. For image representation in this project, we will use tiny images and bags of SIFT features, and for classification, we will use nearest neighbor and linear SVM. We will dive into more detail later on.

Part 1: Tiny images representation and nearest neighbor classifier

Accuracy achieved: 22.5%

The tiny images representation simply resizes the images to some small, fixed resolution, 16x16 pixels in this case. This will obviously lead to a loss of information, particularly the loss of the high frequencies in the images. However, it is very simple and the performance is not exactly bad. The code also makes the tiny images zero mean and unit length to improve the algorithm's weakness in dealing with intensity shifts. Doing this results in a roughly 2% improvement in accuracy.

For our nearest neighbor classifier, we simply find the nearest training example and take its label to assign to the test image. This method is fast to compute because it requires no preprocessing (no training). However, it is also susceptible to noise because if there's some noise in the training images, all of the nearest neighbors will also get classified to a wrong label.

Results

Accuracy (mean of diagonal of confusion matrix) is 0.225

Analysis: The algorithm runs very fast (it completes in less than 1 minute). The accuracy, although low, is surprisingly good for such a simple and seemingly bad method.

Part 2: Bag of SIFT representation and nearest neighbor classifier

Accuracy achieved: 53.1%

In this approach we reuse our nearest neighbor classifier, but instead of tiny images, we choose a better method: bag of SIFT features.

To do this, we first need to build a vocabulary. We get all the features from every training image and feed them into the function vl_kmeans(). This function will basically perform kmeans on our set of features and return the cluster centroids, which make up our vocab. The vocab size does make a big difference to the result accuracy. To see how this parameter affects the accuracy, check out the section Parameters Tuning.

After we've gotten the vocab, we can now represent images as histograms of visual words (or cluster centroids in our case). We again require the use of vl_dsift() to get sift features, then vl_alldist2() and min() to get the closest cluster centroids and increment their counts. At the end we normalize the histogram to ensure that an image doesn't have a different histogram just because it is larger with more SIFT features.

Results

Accuracy (mean of diagonal of confusion matrix) is 0.531

Analysis: The algorithm runs a lot slower due to training. However, the accuracy is much higher (about two times than the accuracy achieved by tiny images).

Part 3: Bag of SIFT representation and linear SVM classifier

Accuracy achieved: 68.0%

Finding nearest neighbor has high variance and low bias, the effect of which can be seen in its sensitivity to noise. A way to combat this is to use linear classifiers. Linear classifiers is on the opposite side of the spectrum with low variance and high bias. It also has weaknesses, however it fares much better in practice because it can deal with noise very efficiently.

In this particular project, we choose to implement 1-vs-all linear SVMS. We have 15 categories and we will train a linear SVM for each category. All 15 linear SVMs will be evaluated on each test image and the most confidently positive classifier wins. There's a free parameter lambda in this method, which can largely affect the results. The implementation is made relative simple with the use of vl_svmtrain(). We only need to compute a vector of 1 or -1 for each category, 1 means that the training image belongs to this category and -1 means it doesn't. After that, we simply use the weight vector W and the offset B returned by vl_svmtrain() to calculate the confidence scores. The max score indicates the category to which a test image belongs.

Results

Accuracy (mean of diagonal of confusion matrix) is 0.680

Category name Accuracy Sample training images Sample true positives False positives with true label False negatives with wrong predicted label

Kitchen 0.530
LivingRoom
Store
Store
Store

Store 0.560
LivingRoom
Kitchen
LivingRoom
LivingRoom

Bedroom 0.410
LivingRoom
InsideCity
LivingRoom
Store

LivingRoom 0.440
Bedroom
Kitchen
Store
Kitchen

Office 0.830
Bedroom
Kitchen
Bedroom
LivingRoom

Industrial 0.580
Kitchen
Coast
Coast
Forest

Suburb 0.950
Street
Mountain
Store
InsideCity

InsideCity 0.650
Kitchen
Kitchen
Kitchen
Store

TallBuilding 0.780
Store
Bedroom
Street
InsideCity

Street 0.610
TallBuilding
Forest
Industrial
Highway

Highway 0.810
Store
Coast
OpenCountry
OpenCountry

OpenCountry 0.540
Coast
Coast
Highway
Coast

Coast 0.790
Office
OpenCountry
OpenCountry
OpenCountry

Mountain 0.780
OpenCountry
Street
Coast
OpenCountry

Forest 0.940
OpenCountry
Industrial
Street
Mountain

Category name Accuracy Sample training images Sample true positives False positives with true label False negatives with wrong predicted label

Category name	Accuracy	Sample training images	Sample true positives	False positives with true label	False negatives with wrong predicted label
Kitchen	0.530					LivingRoom	Store	Store	Store
Store	0.560					LivingRoom	Kitchen	LivingRoom	LivingRoom
Bedroom	0.410					LivingRoom	InsideCity	LivingRoom	Store
LivingRoom	0.440					Bedroom	Kitchen	Store	Kitchen
Office	0.830					Bedroom	Kitchen	Bedroom	LivingRoom
Industrial	0.580					Kitchen	Coast	Coast	Forest
Suburb	0.950					Street	Mountain	Store	InsideCity
InsideCity	0.650					Kitchen	Kitchen	Kitchen	Store
TallBuilding	0.780					Store	Bedroom	Street	InsideCity
Street	0.610					TallBuilding	Forest	Industrial	Highway
Highway	0.810					Store	Coast	OpenCountry	OpenCountry
OpenCountry	0.540					Coast	Coast	Highway	Coast
Coast	0.790					Office	OpenCountry	OpenCountry	OpenCountry
Mountain	0.780					OpenCountry	Street	Coast	OpenCountry
Forest	0.940					OpenCountry	Industrial	Street	Mountain
Category name	Accuracy	Sample training images	Sample true positives	False positives with true label	False negatives with wrong predicted label

Parameters tuning

lambda: Approximately, the smaller the lambda, the higher the accuracy

vocab size: If the vocab size is too small, we don't have enough "words" to describe an image. However, larger vocab sizes do not strictly mean higher accuracy. Higher vocab size will increase the variance but it also decreases the bias. In this case a vocab size of 1000 seems to work well but this may not be the case if we keep increasing the vocab size. I didn't test higher values because the 1000 vocab size already takes one hour to run.

vocab	10	20	50	100	200	400	1000
accuracy	0.443	0.530	0.607	0.625	0.624	0.617	0.680

Bao Vu

Project 4 / Scene Recognition with Bag of Words

Part 1: Tiny images representation and nearest neighbor classifier

Part 2: Bag of SIFT representation and nearest neighbor classifier

Part 3: Bag of SIFT representation and linear SVM classifier

Parameters tuning