Project 5 / Face Detection with a Sliding Window

hog template at 50000 samples.

For project 5, we were required to implement a face detection system using a sliding window. This included handling training data to compute both positive and negative representations, training a linear SVM classifier to create a HoG template, and creating a detector that classifies millions of sliding windows.

Positive HoG Features from Trained Examples

The first step to the face detection pipeline was to create a set of HoG features from the given set of training data (faces). This was supposed to be a fairly simple bit to write though it caused me lots of trouble (remember to add your features to your returned variable, folks!) and it included iterating over the given images, creating a HoG feature using the vl_feat library, and reshaping the feature according to the given template dimensionality.


image = imread(strcat(train_path_pos, '/', image_files(i).name));
image = single(image);
hog = vl_hog(image, feature_params.hog_cell_size);
hog = reshape(hog, 1, D);
features_pos(i,:) = hog; % <- the culprit

Negative HoG Features

The next step was to create HoG features from the given set of non-face training data. This was essentially the same process as creating the positive HoG features but with 2 levels of randomness included. The first was to randomly select from the given non-face set of images. With each image, take random samples of a predetermined size (feature_params.template_size, or 36x36) and create a HoG feature. This section was also pretty straight forward though it had some curveballs. The main design decisions I ran into were how to handle getting the correct number of samples while also handling multiple samples per image.

I had to decide on a value of num_samples that was a good compromise between speed and accuracy. The initial value of 10000 gave me some good results after properly implemented so when I went back to try and improve I wasn't sure how much of an improvement to expect. After trying 10000, 15000, 20000, and 50000, I decided that 15000 gave a small improvement on performance (roughly 1% on average) and anything more hardly improved the scores beyond that. It also seemed that as I got extremely high (50000+) I was getting worse results, which hints at a problem in my get_random_negative_features code that I wasn't able to track down. The exact numbers were difficult to tell due to the randomness of the negative hog implementation.

Training Linear SVM

The next step was to train a linear SVM to be able to distinguish between a positive feature and a negative feature. This was done with a call to vl_svmtrain on our training data and an array of 1s and -1s to represent the respective positive and negative features.

When testing different values of lambda for the svm training function, I started with 0.0001 as the project code says. After trying 0.001 I noticed approximately a 1-2% increase in accuracy after running multiple tests. I then tried 0.01% which brought on a significant decrease in performance so I settled on 0.001.


% my linear svm training code
lambda = 0.001;
pos_trans = features_pos';
neg_trans = features_neg';
data = [pos_trans, neg_trans];
l = ones(1, size(pos_trans,2) + size(neg_trans,2));
l(size(pos_trans,2) + 1 : end) = l(size(pos_trans,2) + 1 : end) * -1;
[w,b] = vl_svmtrain(data, l, lambda);

hog template at 15000 samples hog template at 50000 samples

As you can see from the above images, the difference in templates is pretty minimal while the time increase is significant. The only difference I can tell visually is that the 50000 template seems to have slightly more visible or precise lines outlining the different structures of the face. I was also pleasantly surprised after stumbling upon a piazza post saying to squint when looking at these. I hadn't seen the resemblance as clearly until I did that.

Detector

The final step piecing this pipeline together was to create a detector that evaluated a series of test images based on the output of the linear svm training. The initial task was to analyze the test images on a single scale, which gave mediocre results. The steps were:

After implementing the single scale method, the next task was to utilize the same system but analyze images of different scales to try and match faces that don't match well with the size of the HoG template. This was done by repeating the scanning process outlined above over a given test image multiple times while scaling down the image.

Selecting parameters for this portion was particularly difficult because of the randomness caused by the get_negative_features code, trying to strike a balance between false positives (red boxes on the image) and high precision, as well as keeping the code fast enough to maintain sanity. The primary parameters to tweak were the threshold and scale ratio (which controlled how many scales each image would be compared at). I tried multiple threshold values from 0.9 to 0.3 and decided on 0.58 as a balance of false positives while still maintaining a high precision. For my scale ratio, I chose 0.85 after trial and error as it seemed to give me good results while maintaining the speed of the code. These parameters combined give me on average 83% accuracy with less than 2 minutes runtime. I also tested with 4 pixel cell size and was able to reach 89% precision but a runtime of over 3 minutes.

More Results

Precision curve for single scale Precision curve for multi scale Precision curve for multi scale and 4 cell size


Argentina test case using multi scale detector