Project 5: Face Detection with a Sliding Window

Introduction

This project's goal was given an image as input, output a set of bounding boxes overlaid on faces in the image scene. The HoG feature template used by Dalal and Triggs was implemented as the method for face detection. The project has four major components: 1) Feature extraction from training images 2) Classification using a linear SVM for face or non-face image patches, 3) Sliding window detection at different scales to detect face or non-face in a given region, 4) Evaluation of the bounding boxes produced. For this project, I implemented parts 1, 2 and 3 from the skeleton code provided, and I will describe in more detail the steps involved for each part. I will describe the different approaches and the parameter manipulation I did, and analyze how that impacted the accuracies and performance metrics determined in part 4.

Component 1: HoG Feature Representation of Positive and Negative Training Data

Face detection is a binary task - in this image, is there or is there not a face? In computer vision, the critical part of an object classification algorithm is how the data is represented that will allow a supervised learning method to distinguish between face and non-face data. Since we are trying to replicate the pipeline in Dalal and Triggs, we use Histogram of Gradients (HoG) features. The first part of the pipeline is to compute HoG features for positive instances of faces, and HoG features for negative instances (non-faces), which will be used as training data for our classification algorithm. The positive instances are given as 36x36 image patches. The default configuration I use involves HoG cells of size 6, so the overall size of the HoG feature produced is 6x6x31, which when linearized is a 1116 dimensional feature vector for the face image. For negative training data, I collected 10000 negative examples from the negative training data provided by obtaining random 36 x 36 samples.

Component 2: SVM Linear Separator for Face Detection

Once I gathered training data, I labeled all positive instances of faces (~ 6000 instances) 1 and all negative instances (initially 10000) -1. I then ran a linear svm with a regularization of 0.0001, achieving 99% training accuracy, indicating that face and non-face difference is linearly separable. Visualizing the produced W for a cell size of 6 shows the semblance of a face (especially eyes, nose, and mouth). The images, just like with Gaussian filters, seem to show more of a resemblance of a face when More fine levels of W show the face features in more detail see below for cell size of 4 and 3). Also below is a diagram showing the effect of lambda on average precision, for a cell size of 6, 10 detection scales, and 10000 negative training examples. Based on my analysis, I decided to use an optimal lambda of 1e-8.

HoG Cell 6

HoG Cell 4

HoG Cell 3

Average Precision vs. Lambda

Component 3: Sliding Window Detection

The last important of the pipeline I implemented was the sliding window detector. Initially, I tried implementing a discrete block based window detector, which had terrible average precision (0.02). The reason was not only I only getting partial faces in some of the windows that would not be at high enough confidence to detect a face, but also because my sliding window was iterating over the HoG features and not the original image. I was jumping by 36 pixels each time for HoG cell of size 6. With a sliding window with no image scaling and a template of size 36 x 36, I obtained an average precision of 0.36. Now a sliding window of 36x36 (6x6 in HoG feature space) is not adequate to detect faces when the input images are all at different scales and viewpoints. So, I ran the sliding window at 10 scales for each image, and used maximum suppression to remove overlapping windows that were of different size. Results improved significantly with increased sliding scales. With additional enhancements of 30000 negative training examples and HoG cell size of 4, average precision improved to 0.847.

Additional parameter enhancements I made include doubling the number of positive examples by mirroring each positive example, and also at 16 scales with cell size of 6 improved average precision to 0.865. Then with a cell size 4, average precision improved to 0.912. Finally, with a cell size of 3, 30 scales, and same amount of training data as described above, final average precision is 0.922. Figures and examples are shown below profiling the effect of different parameters on accuracy.

Best Viola-Jones Plot of False Positives

Best Average Precision Recall Curve

Below are image examples of the standard face detector that is run when you unzip my files, on start with no parameters changed. The detector works well and does get high average precision of 0.867, but it still seems to detect a lot of false positives. For example, the class of 57 is all detected correctly, but there are many red boxes surrounding them. Also below are the precision recall curves and the figure that is meant to replicate that of Figure 6 in Viola-Jones paper. This seems to just be an inverse metric of a precision-recall curve, since number of positives start to increase as recall approaches 1.

Out-of-Box Plot of False Positives

Out-of-Box Precision Recall Curve

Group of People - A lot of people detected, but also many false positives

Pixel Faces work As Well

Sketches

A very negative image

Another Group of People

Component 4: Conclusion

Conclusion: By first performing feature representation of the images using either a tiny images descriptor or a bag of sifts feature vector and classifying using either k-nearest neighbors or support vector machine, this project demonstrates an implementation of scene recognition.