Project 5 / Face Detection with a Sliding Window

This computer vision project involved using sliding-window based techniques to find faces within images. These techniques are mainly based on Dalal and Triggs' 2005 paper.

Getting Positive Features

Getting positive features is relatively straightforward. The code here just takes all of the 32x32 images from the Caltech Faces dataset and turns them into HoG features with a given cell size (I discuss parameter tuning later in this writeup). These HoG features are then returned to be used for SVM training.

Getting Negative Features

Getting negative features simply requires using a dataset of images that don't contain faces (here, the SUN scene database), and randomly sampling 32x32 patches from within that dataset and turning them into HoG features with a given cell size. These HoG features are then returned to be used in conjunction with the positive features to train a linear SVM that will be able to tell whether a given HoG feature is a face.

Classifier Training

Given a set of positive features and a set of negative features, a Support Vector Machine finds a hyperplane that seperates the two. SVM returns a W and B such that, for a given HoG feature in question, W*HoG+B is a real number such that negative numbers imply that the feature in question is likely not a face and positive numbers mean that the feature is probably a face, and the distance from zero indicates the confidence. To test that the SVM works, we can test the SVM on the training data:

Initial classifier performance on train data:
accuracy:1.000
true positive rate:0.170
false positive rate:0.000
true negative rate:0.830
false negative rate:0.000

This accuracy seems to be indicative of overfitting to the test data (we get perfect accuracy) but in practice (as seen below) it works relatively well.

Face Detection

There are several steps to face detection, so I'll be breaking it into parts. First, I'll explain a simple, single-scale version of the face detector, and then I'll explain how I generalized it to multiple scales. Single scale face detection involves turning an image into HoG cells via a single call to vl_hog. Once I have these HoG cells, I iterate over them and check every feature_width by feature_width square of HoG cells to see if it looks like a face (using the W and B found using SVM as explained above, I find the confidence that the current bounding box is a face). I then threshold, keeping any bounding box above a certain confidence as a detected face.

Once I've found the faces in the image for the default scale, I go ahead and scale the image, and then find faces for the scaled image (using the exact pipeline explained above). I keep track of the current scale and at the end scale the resulting bounding boxes up accordingly. Once I have all of the bounding boxes for every scale, I do non-maximum suppression to get rid of redundant detections.

Parameter Tuning:

I spent a long time tuning parameters for this project, because there were many to tune and the best values for a given parameter seemed to be dependant on the values of the other parameters. I had to find some kind of local optimum for the values of my parameters to get the best results. I found that lambda = .0001, a confidence threshold of .9, using 50,000 negative features, and running with a scale factor of .7 for every scale possible for a given image (i.e. until the downscaled image was smaller than a single HoG template) lead to the best combination of run time and accuracy. The results with these parameters are shown below.

Results:

Face template HoG visualization for my turned in code. As you can see by squinting, the HoG visualization vaguely resembles a face.

Precision Recall curve for my code.

Results on some of the test images:

Analysis of results:

As you can see, the face detector was able to find faces effectively at many different scales, and did a good job avoiding duplicate detections thanks to non-maximum suppression. Unfortunately there are still quite a few false positives and false negatives. How many of each depends on whether you choose to prioritize precision or recall. If we choose to only consider the most confident matches, (the top 5% or so), then precision is 100%. If, on the other hand, we want high recall (lets say 90%), precision will go down to around 30%. Of course it's easy to get either 100% precision or 100% recall (by returning no bounding boxes, or returning every bounding box as a face, respectively), but that isn't interesting. It's important to look at the precision recall curve as a whole. The average precision (here 81.8%) is the single number that most effectively sums up the results. Looking at the precision-recall graph we can see that if we want, say, 70% recall, we can get 85% precision with this pipeline. Although this may not be great, it runs quickly and could be implemented in a handheld camera to decide focus points.

One of the areas that the face detector seemed to have a lot of isues with is in any grainy parts of the image. It seems that in a noisy background there is enough noise for the image detector to sometimes think there is a face. This is difficult to solve, although it seems that using more training data improves the situation. This may be a place where using a huge amount of data to train the SVM may be necessary. Fortunately, using more data to train the SVM doesn't make the detection part of the pipeline take any longer; it just takes longer to train the model in the first place.

Results on class images:

Below are the results of the pipeline on the images of our class. I did have to modify the initial scale to get these results (otherwise my pipeline would find faces in the patterns in the carpet), but this seems reasonable given that in real life you probably don't want to focus on faces that are very far away (and correspondingly small) anyways. I also tuned the parameters slightly, to get the best qualitative results for this image. I ended up changing the confidence threshold from the .9 used above to 2.

It's interesting to me that the algorithm is convinced that the patch on the wall between the two projection screens is a face.