This project involved building a pipeline that classifies images into one of 15 possible scenes. Provided 1500 training images, 100 of each category, I:
- Created a "tiny image" feature extractor as a baseline, which simply resized each image into a 16x16 size image, which was turned into a length-256 vector which acted as the feature.
- Created a vocabulary of "visual words," or example SIFT features. The vocabulary was created by sampling random features from random images, until a certain number of samples was reached. Then, these features were clustered into a desired number of clusters, called the "vocabulary size."
- Generated a feature representation using the vocabulary for each training and testing image. The representation consisted of finding SIFT features of the image, clustering them into visual words using the vocabulary, and counting how many instances of each visual word occurred in the image.
- Built two different classifiers:
- Nearest Neighbor (1-NN) classifier, which just classifies a test instance with the same label as its closest neighbor in feature space.
- Support Vector Machines (SVM). I built a "one-vs-all" SVM for each category, and classified each test instance according to which SVM provided the most positive confidence that it fell into that category.
The list of scenes was: Kitchen, Store, Bedroom, LivingRoom, Office, Industrial, Suburb, InsideCity, TallBuilding, Street, Highway, OpenCountry, Coast, Mountain, and Forest. One of the difficulties of this choice of classes is that some of these classes may share similar qualities. For instance, Kitchen, Bedroom, and LivingRoom are all residential spaces, InsideCity and TallBuilding both refer to urban architecture, and Street and Highway both refer to roadways.