Project 5 / Face Detection with a Sliding Window

Bounding boxes for facial recognition displayed on top of the ground truth. Ground truth in yellow, true positives in green, false positives in red.

Facial recognition is rather difficult for bag of words algorithms. As it turns out, powerful methods for recognizing specific objects like faces is using histograms of gradient orientations. Given good training data, this can develop a solid stencil to identify what we want. However even with this, we still have the issue of matching this stencil to faces in an image. The naive way to tackle it would be to run through each possible placement of the stencil in the image by sliding a window across the image and checking at each point. While this seems inefficient, there is one factor which helps out. The ratio of image size to number of faces is usually pretty large, so that means most of the sliding frames are going to end up rejected. Hence, if we can come up with a quick way to disprove a particular window of fitting the stencil, we'll still have a relatively efficient algorithm. One way of doing this employs the steps listed below:

  1. Collect HOGs defining the positive features
  2. Collect HOGs defining the negative features
  3. Use these HOGs to develop a linear SVM classifier
  4. Test the classifier on the training data to determine effectiveness
  5. Run detections on new test sets to verify accuracy

Linear SVM classifiers are quick to discard so this is a good way of tackling the issue. Some of these points don't require much elaboration, so we will leave them as is. Only the ones we coded in this project will be expanded.

Collecting Positive Features

The first step is to train our algorithm to figure out what a face looks like. In order to do this, we extract features from a large number of pictures of faces, all the same size. This way we can build a nice, solid foundation. One setback with this is that it is difficult to collect good facial images to train on. The way I mitigated this issue was by extracting features from both the image and from its mirror. It's mirror is still a valid image of a face, so we can get twice the information this way.

Collecting Random Negative Features

Now that we know what a face looks like, we still need data to train our algorithm on what a face doesn't look like. Luckily, finding images without faces is easy as it can be literally anything else. The more depth and variety of non face images we find, the better the algorithm will be at finding true negatives. Hence, I expanded this set of images by mirroring again to collect more negative data. Futhermore, I was able to flip it in all directions as orientation was no longer an issue. This makes the only limiting factor memory and speed. Even though theres a large amount available, we only sample it to not lose speed and waste memory. For this implementation, I sampled 20,000 negative images.

flippy = {[false,false],[false,true],[true,false],[true,true]};
for flipper = 1:4
    toFlip = flippy{flipper};
    for i = 1:num_images
        path = fullfile(non_face_scn_path, image_files(i).name);
        image = single(imread(path))/255;
        if(size(image,3) > 1)
            image = rgb2gray(image);
        end
        if(toFlip(1) == true)
            image = flipdim(image,1);
        end
        if(toFlip(2) == true)
            image = flipdim(image,2);
        end
        frames = im2col(image, [feature_params.template_size feature_params.template_size], 'distinct');
        for f = 1:size(frames,2)
            fCol = frames(:,f);
            frame = reshape(fCol, [feature_params.template_size feature_params.template_size]);
            HOG = vl_hog(frame,feature_params.hog_cell_size);
            row = reshape(HOG,1,D);
            N = N + 1;
            allNegs(N,:) = row;
            %avoid memory issues and keep sampling to different original
            %images as much as possible
            if(N >= 2*num_samples)
                break;
            end
        end
        if(N >= 2*num_samples)
            break;
        end
    end
    if(N >= 2*num_samples)
        break;
    end
end
chosen = randperm(N,num_samples);
features_neg = allNegs(chosen,:);

Linear SVM classifier

Now that we have the data stored in HOGs, all we have to do is train a classifier. For this implementation, I chose to use a linear SVM classifier. Essentially I just set up a linear equation where the weighted features of faces sum to equal 1 and the weighted features of everything else go to -1. This classifier managed to correctly identify 99.8% of the training set. The code for this is displayed below. Lambda is a variable that can be played around with to improve accuracy as well, but I didn't adjust it much for this project. The value it is at works well enough.

pos = size(features_pos,1);
neg = size(features_neg,1);
X = [features_pos',features_neg'];
Y = [ones(1,pos),-1 * ones(1,neg)];
lambda = 0.0001;
[w b] = vl_svmtrain(X, Y, lambda);

Running Detector on Test Set

This is where we actually try locating faces in new images. These test images have had faces all prelabelled in advance, so its a good way to test our algorithm for false positives and negatives. A simple implementation involves sliding a window the size of the template along the entire image and testing the HOG conversion with our linear SVM to see if the score is high enough to warrant the label of a face. Hence, we need a threshold to determine what can be considered high enough. This is one variable which has a large impact on the accuracy of the algorithm. However, just going with one scale isn't good enough. Many images have faces that are smaller in the background, or up close to the camera and larger. The window size we're using could be to big or small to properly detect a face, so we have to find some way to detect faces for different window sizes. The problem with increasing the window size itself is that it wouldn't match the dimensions for the linear classifier anymore. Instead, I scaled down the size of the image after every pass to find detections, and then readjusted bounding boxes accordingly based on positive hits from there. This scaling factor is also something I played around with to try and improve accuracy.

Results

I played around with lots of different values for the threshold and scaling size, as well as the hog cell size. Scaling size seemed to be a trade off between accuracy and speed. I finally ended on scaling the image down by a factor of 0.9 at every iteration to be the optimal combination of accuracy and speed. As it turned out, different thresholds work better for different cell sizes. Using a threshold of 1 got me the best results for a cell size of 6, but a threshold of 0.5 faired better than 1 for the cell sizes of 4 and 3. The table below compares the results of running the detector on different levels for a few images. The first row is for the optimal setting of the single scale detector while the remaining three rows are the optimal settings for multiscale detectors with cell sizes of 6, 4, and 3. Sometimes it seems like it actually gets worse, but if you look closely each progressive step captures more true positives then the previous step.

Face template HoG visualization for cell sizes of 6, 4, and 3.

Precision Recall curves for best single scale, multiscale with a cell size of 6, then 4, then 3.

As can be seen from the figures above, the classifier gets more and more accurate as the HOG cell size gets smaller. The features are much more personalized. The last stencil almost looks like a face as is. The only issue with this is it takes a really long time to run.