Project 5: Sliding Window Face Detection

Ian Buckley

November 11, 2016

1 Introduction

The purpose of this project was to implement a simplified sliding window face detection alogorithm using histogram of gradients (HOG) feature descriptors and linear SVM classification. The face detection pipeline is first described. Then, an implementation of HOG feature descriptor is detailed. Lastly, results of the pipeline are presented for various parameter tunings.

2 Approach

The sliding window face detection algorithm can be broken down into the following steps:

  1. Generating positive HOG feature descriptors
  2. Generating negative HOG feature descriptors
  3. Training a linear SVM classifier
  4. Sliding window face detection at multiple scales

Positive HOG Features Positive HOG feature descriptors were generated by using vl_feat to perform HOG on approximately 6700 36×36 pixel face images. The default parameters were used, resulting in HOG cells of 36 pixels with feature length 31 resulting from the UoCTTI method.

Negative HOG Features Negative HOG feature descriptors were generated using the same procedure as the positive HOG features. 35000 36×36 image patches were randomly chosen from a set of approximately 250 images not containing faces.

Classifier Training Using the positive and negative flattened HOG feature descriptors, a linear SVM was trained to classifiy image patches as either containing faces or not. For all tests, λ = 0.00001 was used.

Face Detection Face detection was performed using the trained classifier to classify sliding window image patches as either containing faces or not at multiple scales. The HOG representation of each test image was generated with the same methods used in training, and a sliding window over the HOG representation of the scaled image was classified. Scaling of the image was performed by resizing the image to 0.9α for α = 0,1,...,30. For each descriptor, classifications were rejected outright for values less than 0. Non-maximal suppression was used on the remaining classified sliding window image patches to select the bounding box that best covered the faces.

3 Histogram of Oriented Gradients

A HOG feature descriptor, based on the results of Dalal and Triggs, was implemented to satisfy the graduate requirements of the assignment. A brief description of the HOG feature description algorithm follows:

  1. The magnitude and orientation are determined for each pixel in the image.
  2. The image is broken down according to a user-defined cell size.
  3. For each cell, a length 9 histogram of the gradients is created by binning them according to their orientation on [0,180).
  4. The histogram is normalized in 4 different ways (L2, L2 saturated at 0.2, L1, and √ ---
  L1).
  5. The normalized histograms are concatenated into a 36 element feature vector that corresponds to the cell.

The implemented HOG was used to train the face detector shown in Figure 1 with a cell size of 4, which can be compared to the face detector generated using vl_HOG with the same cell size shown in Figure 2. While the implemented HOG face detector was viable, it used the Dalal Triggs variant–the vl_HOG face detector used the UoCTTI variant. The performance of the implemented HOG face detector was understandably worse.


PIC

Figure 1: Implemented HOG face detector



PIC

Figure 2: vl_feat HOG face detector


4 Results

The results of the sliding window face detection using the HOG face detector are summarized in Table 1. Enormous improvements to average precision is achieved by performing face detection at multiple scales. Yet better improvements are obtained by decreasing the cell size of the HOG feature descriptor. The best average precision was 93.1% achieved with a cell size of 3; this results is captured in Figure 3. Further improvements could be achieved by decreasing the cell size to 2, but memory constraints made this infeasible for the purposes of this project.


Method Average Precision




  
Single Scale (Cell Size = 6) 34.7%
Multiple Scale (Cell Size = 6) 88.5%
Multiple Scale (Cell Size = 4) 92.5%
Multiple Scale (Cell Size = 3) 93.1%


  

Table 1: Summary of Results


PIC

Figure 3: Average Precision of Sliding Window Face Detection


Face detection on the CalTech dataset can be seen in the following figures. While there are significant numbers of false positives, the vast majority of faces in the scenes are detected by the sliding window approach.


PIC



PIC



PIC



PIC



PIC



PIC



PIC



PIC


5 Conclusion

The sliding window approach to face detection by using HOG feature descriptors was successful, achieving an average precision of 93.1%. While faces were detected, the large numbers of false positives were this biggest limitation of this approach. Future work will explore probabilistically rejecting false positives.