Project 5: Face detection with a sliding window


Detection Results on Class Photo

The objective of the project Face detection with a sliding window is to implement a simpler version of sliding window classifier for face detection. The classifier uses multi-scale HoG versions for recognition. The highest average precision achieved by the classifier is at 91.8%.


Methodology


1. Training a svm classifier from facial and non-facial images

A number of facial and non-facial images are provided as the training set. Windows of defined size are obtained from the training set and used for training a linear svm classifier. With about 6000 positive windows and 10000 negative ones, the svm classifier can be easily trained within 30 seconds. The below figure shows the svm training result with different cell size. The hog features seem to make sense as they construct the human faces.

cellsize = 6 cellsize = 4 cellsize = 3

2. Multi-scale Face detection with a Sliding Window

The image is scaled down by a factor of 0.9 in each iteration for a total of 20 iterations. Threshold is set at the level of 0.7 to eliminate negative results as well as positive results without enough confidence. In general, Average Precision increases as the cell size decreases.

The table below shows the Accuracy-Recall curve with different setup and cell size. Note that by setting the threshold at low level, the system can obtain maximum recall, even if they return mostly false positives. As the results are ranked by confidence, the average precision can actually be boosted with a low threshold level. It is not practical for actual detection but can be used as a trick to achieve best Average Precision possible.

Cell Size Accuracy-Recall Curve with reasonable threshold Accuracy-Recall Curve with low threshold
6
4
3

The following table is the detection result obtained with different cell size setup.

Cell Size Dectection
6
4
3

Extra Points Implementation


Hard Negative Mining

I have implemented hard negative mining in order to improve the quality of the training set. With the parameters obtained through the first run of svm training, classification results are obtained on each window for non-facial image. The window would be incorporated into the training set if the result reports positive based on a threshold. The threshold is set to 1.0 and the threshold on the actual detection is kept at 0.6.

As I still used a linear classifier for hard negative mining, it is possible that the result would not be so perfect. This is because what hard negative mining does is assigning some of the feeatures in the positive cluster to negative class. A linear svm would not be able to separate them properly. However, the following inferences can be made and are actually observed through the implementation.
1. Fewer results would be returned if the threshold is kept at the same level. The linear classification boundary would be much more strict about the features when including some hard negative features, making the confidence a bit lower. If the same threshold is used, fewer results would be returned.
2. In the accuracy-recall curve, for the same level of recall, the accuracy should be at least a bit better than the result without hard negative mining. This conclusion is intuitive. As more hard negative features are included, the classifier is surely better at recognizing negative features.


The following tables show that the results match the inferences I made above.

Without Hard Negative Mining With Hard Negative Mining

The above table shows the detection result with and without hard negative mining. Hard Negative Mining eliminates several false positive results and achieves better accuracy.

Without Hard Negative Mining With Hard Negative Mining

The above table shows the accuracy-recall curve with and without hard negative mining. Hard negative mining should have returned less detection results with the same level of threshold. In order to let the results make sense, I decreased the threshold to 0.6 to have almost the same recall in the end.
The two curves look identical, but if they are compared really carefully, it is not hard to discover that the curve for hard negative mining is always a little bit higher than the curve without it, which matched my inference before.


Detection Results


The results are decent for easier images with front faces, especially for large faces. For hard images, the system can still recognize most faces that do not differ a lot from front faces.



Detection Results on Easy Class Photo


Detection Results on Hard Class Photo