Project 5 / Face Detection with a Sliding Window

This project aims at implementing Face Detection feature which is one of the most noticeable successes of computer vision. We use sliding window approach on each test image individually to classify the patch as a human face or not (Dalal and Triggs paper). Following Steps are implemented for face detection:

  1. Generate positive HOG features
  2. Generate negative HOG features
  3. Train a SVM classifier using the above features
  4. Run the trained classifier on the test image set

Computer Vision - 2016 - Gatech (Easy)

Computer Vision - 2016 - Gatech (Hard - Detections on the screen are still good!!)

For both the images, the HOG cell size = 6, threshold = 1.5, Lamda = 0.0001

Generate HOG features

Positive Images

Used 36X36 cropped images from Caltech Web Faces Project as the positive training database. It has 6713 images in total. vl_hog was used to generate Histograms of oriented gradients (HOG) features. One of the arguments of this function is the cell size. As mentioned in the project description, I flipped the images (mirror images) to have positive training set of 13426.

Negative Features

Used scenes from Wu et al. and SUN scene database to obtain negative features. I selected 36x36 random patches from different images and obtained a HOG feature from them. All the images were used to get random 10000 negative features for SVM training.

VL Hog configuration: vl_hog: descriptor: [6 x 6 x 31] vl_hog: glyph image: [126 x 126] vl_hog: number of orientations: 9 vl_hog: variant: UOCTTI

Train SVM Classifier

The labeled HOG features from the previous section are used to train the SVM classifier. I used vl_svmtrain function from vl_feat library to train the classifier. I tried varying the Lambda parameter, whose results are shown below. I obtained the best results for Lambda = 0.00001. The threshold was 1 for SVM Classifier for the cell size = 6.

Lambda (HOG Cell Size = 6) Average Precision %
0.01 78.2
0.001 81.4
0.0001 82.1
0.00001 83.4
0.000001 83.2

Running trained Classifier on test image set

Test set of 130 images were used to find the accuracy of the classifier. The sliding window approach from Dalal and Triggs paper was used to detect faces. To speed up the process, the entire image was converted to HOG feature space first and then window is moved over this HOG space for face classification. I tried the tests for different hog cells. Also, the classifiers output is compared against a threshold to decide if the object is a face or not. Faces can be present at different scales in a image. I run the detector on several scales to get as many faces as possible. It is followed by non max suppression pass to remove redundant detected windows.

I ran the detector on 10 different scales: 1, 0.9, 0.81, 0.72, 0.63, 0.54, 0.45, 0.36, 0.27, 0.18

Table below shows the effect of different threshold values on the Average precision (Lambda = 0.0001) for cell size = 6.

It was observed that for lower threshold values, although the average precision was high but more number of false positives were getting generated.

Threshold - SVM Classifier (HOG cell size = 6) Average Precision %
-1 84.7
-0.5 84.7
0 84.1
1 83.5
2 72.0
3 44.0

Results

I ran the test image set with different hog cell size (also one of the arguments of vl_hog) to measure the average accuracy. As the cell size is reduced, feature dimensions increases and performance (speed) went down. It took a lot of time to get the features first, and even the classification task got affected with higher feature dimensions. The average accuracy improved but not significantly. Results for cell size = 3, 4, 6 are shown below.

HOG Cell Size = 3

Face template HOG visualization. hog cell size = 3

Precision Recall curve on 10 scales, threshold = 0.5 and Lambda = 0.0003.

Example of detection on the test set

HOG Cell Size = 4

Face template HOG visualization. hog cell size = 4

Precision Recall curve on 10 scales, threshold = 0.7 and Lambda = 0.0003.

Example of detection on the test set

HOG Cell Size = 6

Face template HOG visualization. hog cell size = 6

Precision Recall curve on 10 scales, threshold = 0.8 and Lambda = 0.0001.

Example of detection on the test set

Extra Credits

Mine Hard Negatives

The SVM classifier is trained again with hard examples. The negative training photos are exhaustively searched for false positives above certain confidence level. After getting the SVM classifier paramaters (w and b) from the initial training set, I used this weight and bias again on the negative test images to find hard negatives. The function loops over all the images, randomly selects the patches, obtains the hog features and compares the classification with the set threshold. If the feature's confidence level is above the threshold then the feature is considered the hard example. I added a loop that ensures that I get atleast 5000 hard examples.

Observation: I could see very little improvement in precision. But, I could reduce the threshold to a negative value without having more false positives.I observed that Audrey's image could now be detected with some falso positives there.

Face template HoG visualization. Number of hog cells = 3 (Mine Hard Negatives)

Precision Recall curve on 10 scales, threshold = -0.6 and Lambda = 0.0003.

Example of detection on the test set

Mirroring, Rotating the Positive Training Set

To increase the number of Positive training features, I took the mirror images of the pictures using flipdim function from Matlab. To further increase the dataset, I rotated the positive images by +5, -5 degrees using imrotate function from Matlab with certain parameters like 'crop' etc. With rotation, I didnt see any significant improvement, so I kept the original and mirrored images in the positive features. (The rotation code is commented in the code )

HOG Descriptor Implementation

I tried the Dalal and Triggs implementation of HOG. I found an implementation online from the link (https://www.mathworks.com/matlabcentral/fileexchange/28689-hog-descriptor-for-matlab). This HOG implementation works for cell size of 8, that I changed to 6. Each block has four cells. 9 bins with bilinear interpolation are used in the range [0, pi]. The steps are as follows: Gradients are computed in both X and Y direction, then find Angle and Magnitude. Based on angle, bins are assigned. The bin values are weighted with gradients magnitude. Then the histograms are normalized (L1 followed by L2 norm). The feature length obtained is 5*5*9*4(900). But, somehow accuracy was really bad. I tried to change the code, the parameters values, scaling, but couldnt get significant improvement.

Conclusion

Face detection algorithm was implemented in the project. HOG features with SVM classifier over sliding window was used to detect faces. The experiments included running the detection for different HOG cell sizes (3, 4 and 6), tuning Lambda (SVM Trainer parameter) and the threshold for SVM Classifier. Precision is shown with different values of Lambda and threshold. Many iterations of the tests were performed to fine tune the classifier to give optimum results for test image dataset as shown above. With smaller HOG cell sizes, average precision got better but it came with a cost of poor performance at training and testing time. Changing Threshold impacts the precision, but also gives more false positives for smaller values. Hard Negatives, one that are close to SVM hyperplane, make the classifier robust. It produces less number of false positives even for small threshold values.