This project covers scene recognition using two different image representation types and two different classifiers. The scenes identified include nature examples such as mountain, forest, and coast, urban examples such as highway and tall building, and indoor examples such as kitchen and office. The accuracy of these recognition techniques ranges from approximately 20% using the less sophisticated representation type and classifier, to almost 70% using more advanced techniques.
The tiny image representation works by resizing each image to a very small (16 x 16) resolution. These resized images are used as training and test data. A collection of tiny training images, training labels, and tiny test images are passed to the nearest neighbor classifier. This classifier uses a pairwise distance calculator to find the training image with the most similar features to each test image, and labels the test image the same as the training image. My implementation actually returned more accurate results with 1 nearest neighbor, but the code is written in a way that would allow K to be a different value.
Accuracy (mean of diagonal of confusion matrix) is 0.201
Category name | Accuracy | Sample training images | Sample true positives | False positives with true label | False negatives with wrong predicted label | ||||
---|---|---|---|---|---|---|---|---|---|
Kitchen | 0.030 | Suburb |
Suburb |
OpenCountry |
Coast |
||||
Store | 0.020 | Kitchen |
Forest |
OpenCountry |
Coast |
||||
Bedroom | 0.090 | LivingRoom |
Kitchen |
Coast |
Coast |
||||
LivingRoom | 0.070 | Store |
Highway |
Coast |
Street |
||||
Office | 0.110 | Mountain |
Forest |
Forest |
TallBuilding |
||||
Industrial | 0.090 | Coast |
TallBuilding |
OpenCountry |
Highway |
||||
Suburb | 0.170 | OpenCountry |
InsideCity |
Bedroom |
Kitchen |
||||
InsideCity | 0.030 | LivingRoom |
Bedroom |
Forest |
Forest |
||||
TallBuilding | 0.130 | Kitchen |
Office |
Industrial |
Highway |
||||
Street | 0.430 | LivingRoom |
Suburb |
Forest |
Forest |
||||
Highway | 0.600 | Coast |
Industrial |
OpenCountry |
OpenCountry |
||||
OpenCountry | 0.360 | Kitchen |
Street |
Highway |
Forest |
||||
Coast | 0.440 | Kitchen |
Mountain |
Mountain |
Forest |
||||
Mountain | 0.190 | LivingRoom |
OpenCountry |
Forest |
Office |
||||
Forest | 0.260 | LivingRoom |
Street |
OpenCountry |
OpenCountry |
||||
Category name | Accuracy | Sample training images | Sample true positives | False positives with true label | False negatives with wrong predicted label |
Since tiny images are a poor means of scene representation, the next step is to implement a more sophisticated technique called a bag of SIFT representation. SIFT stands for Scale-Invariant Feature Transform, and is an algorithm for detecting and identifying local features in images. This required first creating a vocabulary of visual words that features of images can be compared against. To achieve this, each image was sampled a number of times (I chose 100) to find SIFT features for that image, using a step size greater than 1 to avoid huge computation time (I chose 10). Once features were sampled from all the images, the cluster centroids were found using k-means clustering. These centroids are the visual word vocabulary.
My choice of 100 as a number of SIFT features to detect when building the vocabulary was somewhat arbitrary, but I also tested 220 (as suggested in the homework assignment) as a parameter. I found that 220 performed very slightly better using the nearest neighbor classifier (0.532 vs. 0.531) and slightly worse using the SVM classifier (0.687 vs. 0.693). This increase or decrease in performance was overall negligible and I chose to keep this parameter as 100.
The bag of SIFT representation algorithm goes through a set of images and generates a matrix of SIFT features for each. In this case, the step size is smaller (I chose five) to get a more fine-tuned result. I built a histogram with each location corresponding to to the visual words from the pre-determined vocabulary. For each SIFT feature in the image, I found the corresponding visual word closest to it and incremented that index in my histogram. The image features returned are a normalized histogram of visual words found in each image. (Note: the bag of SIFT representation takes several minutes to run.)
Category name | Accuracy | Sample training images | Sample true positives | False positives with true label | False negatives with wrong predicted label | ||||
---|---|---|---|---|---|---|---|---|---|
Kitchen | 0.470 | Bedroom |
Industrial |
LivingRoom |
LivingRoom |
||||
Store | 0.460 | TallBuilding |
InsideCity |
Kitchen |
Street |
||||
Bedroom | 0.200 | LivingRoom |
Store |
Office |
LivingRoom |
||||
LivingRoom | 0.400 | Bedroom |
Industrial |
Industrial |
Store |
||||
Office | 0.850 | LivingRoom |
Bedroom |
LivingRoom |
LivingRoom |
||||
Industrial | 0.330 | Store |
Highway |
Store |
Store |
||||
Suburb | 0.870 | Highway |
TallBuilding |
Coast |
OpenCountry |
||||
InsideCity | 0.400 | Kitchen |
Store |
Street |
Street |
||||
TallBuilding | 0.400 | Street |
Industrial |
Store |
Forest |
||||
Street | 0.560 | InsideCity |
InsideCity |
Kitchen |
Forest |
||||
Highway | 0.650 | Suburb |
InsideCity |
Industrial |
Mountain |
||||
OpenCountry | 0.400 | Highway |
Coast |
Coast |
Industrial |
||||
Coast | 0.520 | OpenCountry |
Highway |
Mountain |
OpenCountry |
||||
Mountain | 0.510 | Street |
Suburb |
Forest |
Highway |
||||
Forest | 0.950 | Mountain |
Mountain |
OpenCountry |
Mountain |
||||
Category name | Accuracy | Sample training images | Sample true positives | False positives with true label | False negatives with wrong predicted label |
To achieve an even better accuracy, a 1-vs-all Linear SVM can be used as a classifier. For each unique category (corresponding to the scene labels), a set of binary training labels is found by making the labels matching the current category +1 and every other value -1. A SVM is trained using the training data and modified labels (as well as a parameter lambda, to be discussed in more detail below), and returns a set of weights and biases such that the score W' * X(:, i) + B has the sign of labels(i) for all i (from MATLAB vl_svmtrain documentation). After performing this for each category and storing the weights and biases, the best category for each test image can be found by finding the confidence score using W*X + B, and returning the categories associated with the most confident scorings.
The parameter lambda made a huge difference in the accuracy of the SVM classifier. I found that a very small value for lambda, 0.0001, performed the best, with an accuracy of 0.693. Making lambda smaller (0.00001) resulted in an accuracy of 0.657, and as lambda got larger, the accuracy quickly decreased. The following are different values of lambda and their corresponding accuracies: 0.001 = 0.603, 0.01 = 0.469, 0.1 = 0.521, 1 = 0.315.
Category name | Accuracy | Sample training images | Sample true positives | False positives with true label | False negatives with wrong predicted label | ||||
---|---|---|---|---|---|---|---|---|---|
Kitchen | 0.650 | InsideCity |
Office |
Office |
Bedroom |
||||
Store | 0.560 | LivingRoom |
Bedroom |
TallBuilding |
Bedroom |
||||
Bedroom | 0.430 | Kitchen |
LivingRoom |
Industrial |
Kitchen |
||||
LivingRoom | 0.330 | Bedroom |
Street |
Bedroom |
Office |
||||
Office | 0.920 | LivingRoom |
Kitchen |
Kitchen |
Kitchen |
||||
Industrial | 0.510 | Highway |
Bedroom |
OpenCountry |
InsideCity |
||||
Suburb | 0.950 | Mountain |
Mountain |
OpenCountry |
Street |
||||
InsideCity | 0.520 | Store |
Industrial |
Office |
Industrial |
||||
TallBuilding | 0.790 | Industrial |
Industrial |
LivingRoom |
Street |
||||
Street | 0.790 | Bedroom |
LivingRoom |
Highway |
TallBuilding |
||||
Highway | 0.800 | OpenCountry |
Industrial |
Coast |
Coast |
||||
OpenCountry | 0.500 | TallBuilding |
Highway |
Coast |
Mountain |
||||
Coast | 0.850 | OpenCountry |
Highway |
Bedroom |
OpenCountry |
||||
Mountain | 0.850 | Bedroom |
OpenCountry |
Suburb |
LivingRoom |
||||
Forest | 0.940 | Mountain |
TallBuilding |
Mountain |
Mountain |
||||
Category name | Accuracy | Sample training images | Sample true positives | False positives with true label | False negatives with wrong predicted label |