Project 5 / Face Detection with a Sliding Window

The purpose of this project is to implement a face detection algorithm using sliding window with histogram of oriented gradients (HoG) and support vector machine (SVM). This project is divided into four sections.

  1. Build positive training samples (w/ features)
  2. Build negative training samples (w/ features) using random sampling of (sub-)scenes
  3. Construct (SVM) classifier
  4. Provide sliding window classification with non-maxium suppression and scaling

In order to perform a supervised machine learning such as SVM, we need to provide features and corresponding classifications. Feature extraction is important in machine learning, and we use HoG feature descriptor on images. HoG divides images into subsections (cells), similar to SIFT (scale-invariant feature transform descriptor), and counts gradient orientation occurrences in each cell. Then, overlapping local contrast normalization is applied. HoG was popular for object detection before ConvNets became a thing... (LOL). In this project, CalTech Web Face and SUN scene database were used to build positive (value = 1) face samples, and negative (value = -1) scene samples. All face samples were 32x32 grayscaled images, but scene samples had varying sizes. Therefore, multiple random patches of scene samples were used to train the SVM model. A linear SVM model was chosen (from VLFeat library), as research have shown SVM to perform well with HoG feature descriptor.

I initialized parameters as follows:


num_negative_examples = 10000; % Number of random scene examples
LAMBDA = 1e-5; % SVM parameter
scales = [1, .89, .55, .34, .21, .13, .08, .05]; % Scales used in run_detector.m
MIN_CONF = 0; % Minimum value required from SVM testing to be recognized as a face
feature_params = struct('template_size', 36, 'hog_cell_size', 6); % HoG template and cell sizes used

Additional Face Examples

In addition to CalTech Web Face database, I also added 11765 face examples. These face examples from LFW dataset were grayscaled and resized to 32x32.
You can see this saved on my website: https://fluf.me/cv/additional_faces/
Warning: It may freeze your browser for a minute since it's loading 11765 files.

The first row shows faces from LFW and bottom row with CalTech Web Face.
The difference is easy to tell. LFW dataset does not only include faces, but also cover shoulders and background. Not surprisingly, average precision dropped by 17.1%, on average, when LFW dataset was combined with CalTech Web Face dataset.
I decided to only use CalTech Web Face. It likely would have performed better if LFW images were additionally cropped to just keep faces.

Varying number of negative examples

In get_random_negative_features.m, for each image scene, I chose additional number of samples based on number of negative examples and scenes. I was told that increasing number of negative examples should increase precision performance. However, mine appeared to perform worse. This may be due to class imbalance as SVMs are known to perform poorly on unbalanced dataset. Since random patches are selected, I ran four iterations per configuration.

I chose 10000 negative examples which gave 0.83675 precision on average.
Note: There were 6713 positive examples.

Varying SVM labmda parameter

Linear SVM's lambda is a regularization parameter that controls the bias or the amount of overfitting to the training examples. I have tried multiple values of lambda, and results are shown below.

I chose lambda as 1e-6 which improved average precision by 0.01825 to 0.855.

Varying number of scales

We are not guaranteed that images we want to recognize faces are scaled accordingly to our training examples. Therefore, I rescaled the image by different ratios ([1, .89, .55, .34, .13, .08, .05, .02, .01]) and ran sliding window model. I chose these numbers based on Fibonacci sequences.
The chart below shows performance based on number of scales starting with 1:1 ratio and appending smaller ones in order.
As number of scales increased, the performance generally improved.
I chose scales as [1, .89, .55, .34, .13, .08, .05, .02] with average precision of 0.855.

Varying MIN_CONF

Whenever a testing sample is fed into our SVM equation, we get a (classification) value. I set the face class value as 1, and scene with -1. I varied the cutoff value for face classification to see any improvements in precision. After this filtering process, we also perform non-maximum suppression per image to retrieve the most confident range while removing multiple detections at same locations due to scaling.

I chose MIN_CONF as -.1 with average precision of 0.86.

Varying HoG pixel cell size

Greatest precision improvement came with varying HoG pixel cell size. As we decrease cell size, theoretically, we should capture more information at the cost of runtime.

HoG pixel cell size at 3 pixels performed the best with average precision of .917.
Below are some of the results with this final configuration.