The purpose of this project is to implement a face detection algorithm using sliding window with histogram of oriented gradients (HoG) and support vector machine (SVM). This project is divided into four sections.
In order to perform a supervised machine learning such as SVM, we need to provide features and corresponding classifications. Feature extraction is important in machine learning, and we use HoG feature descriptor on images. HoG divides images into subsections (cells), similar to SIFT (scale-invariant feature transform descriptor), and counts gradient orientation occurrences in each cell. Then, overlapping local contrast normalization is applied. HoG was popular for object detection before ConvNets became a thing... (LOL). In this project, CalTech Web Face and SUN scene database were used to build positive (value = 1) face samples, and negative (value = -1) scene samples. All face samples were 32x32 grayscaled images, but scene samples had varying sizes. Therefore, multiple random patches of scene samples were used to train the SVM model. A linear SVM model was chosen (from VLFeat library), as research have shown SVM to perform well with HoG feature descriptor.
I initialized parameters as follows:
num_negative_examples = 10000; % Number of random scene examples
LAMBDA = 1e-5; % SVM parameter
scales = [1, .89, .55, .34, .21, .13, .08, .05]; % Scales used in run_detector.m
MIN_CONF = 0; % Minimum value required from SVM testing to be recognized as a face
feature_params = struct('template_size', 36, 'hog_cell_size', 6); % HoG template and cell sizes used
In addition to CalTech Web Face database, I also added 11765 face examples. These face examples from LFW dataset were grayscaled and resized to 32x32. You can see this saved on my website: https://fluf.me/cv/additional_faces/ Warning: It may freeze your browser for a minute since it's loading 11765 files.
The first row shows faces from LFW and bottom row with CalTech Web Face. The difference is easy to tell. LFW dataset does not only include faces, but also cover shoulders and background. Not surprisingly, average precision dropped by 17.1%, on average, when LFW dataset was combined with CalTech Web Face dataset. I decided to only use CalTech Web Face. It likely would have performed better if LFW images were additionally cropped to just keep faces.
In get_random_negative_features.m
, for each image scene, I chose additional number of samples based on number of negative examples and scenes. I was told that increasing number of negative examples should increase precision performance. However, mine appeared to perform worse. This may be due to class imbalance as SVMs are known to perform poorly on unbalanced dataset. Since random patches are selected, I ran four iterations per configuration.
Linear SVM's lambda is a regularization parameter that controls the bias or the amount of overfitting to the training examples. I have tried multiple values of lambda, and results are shown below.
Whenever a testing sample is fed into our SVM equation, we get a (classification) value. I set the face class value as 1, and scene with -1. I varied the cutoff value for face classification to see any improvements in precision. After this filtering process, we also perform non-maximum suppression per image to retrieve the most confident range while removing multiple detections at same locations due to scaling.
Greatest precision improvement came with varying HoG pixel cell size. As we decrease cell size, theoretically, we should capture more information at the cost of runtime.