This project is an example of the sliding window classification model in computer vision. This classification model is very prevalent and accurate at object detection, namely face recognition. In this project, the implementation is based on the 2005 paper by Dalal and Triggs, which uses a histogram of gradients to represent faces. Given a training set of known faces and non-faces, a HoG can be generated for each positive and negative datapoint and used to train a classifier. After the classifier is trained, the general idea is to take a sliding window across some unseen image or scene and let the classifier determine if the given window in the scene is a face or not. For some positive match, if the confidence of the classifier is above some set threshold, then the image pipeline would identify that as a face.
The first 3 parts of the project involved generating positive and negative features for a classifier algorithm and then training the algorithm. The positive features were created from known ground truths while negative features were randomly sampled from images that were known not to have faces. In this pipeline, the features were histograms of gradients. After the features were extracted, an simple vector machine (linear classifier) was trained with the positive and negative labeled data. As a sanity check, the SVM was tested on the training data, and as expected, performed well.
Initial classifier performance on train data:
accuracy: 0.999
true positive rate: 0.398
false positive rate: 0.000
true negative rate: 0.601
false negative rate: 0.000
The first facial detector implemented was a single-scale detector, which did not resize the image in any way. All the single-scale detector did was use a sliding window on each test image, generate a HoG for the sliding window, and query the SVM to see whether the bounding window was a face.
Average precision: 0.314
True positives: 175
False positives: 392.33
False negatives: 336
Total time: 108.848 s
These results were used as a baseline for the future changes and tweaks of the detection pipeline. As a note, the cell size was 6
, the lambda value was 0.0001
and the confidence threshold was 0.75
. These values may become relevant later.
Multi-scale detection fixes one of the biggest problems in single-scale detection, which is that faces can be of different scale in an image. Multi-scale runs the same sliding window detector algorithm at different scales of the image so that faces of different scale can still be found.
Average precision: 0.858
True positives: 460
False positives: 483.33
False negatives: 51
Total time: 195.121 s
Average precision: 0.873
True positives: 467
False positives: 431
False negatives: 44
Total time: 196.815 s
Run 2
Average precision: 0.853
True positives: 459
False positives: 363
False negatives: 52
Total time: 195.234 s
Run 3
Average precision: 0.847
True positives: 454
False positives: 521
False negatives: 57
Total time: 193.313 s
Empirically, the most noticeable improvement was in the average precision, which rose from an average of 0.351
to an average of 0.874
. Additionally, the average false negatives fell to double digits as the algorithm was able to identify different-scaled faces.
In an attempt to further improve the precision and results of the facial detection, the cell size was decreased to 4 pixels and 3 pixels respectively. In theory, by reducing the values cell size values, the average precision should increase because the algorithm will using a more granular step for each image.
Average precision: 0.881
True positives: 480.67
False positives: 306.67
False negatives: 30.33
Total time: 594.173 s
Average precision: 0.865
True positives: 485
False positives: 223
False negatives: 26
Total time: 601.425 s
Run 2
Average precision: 0.887
True positives: 474
False positives: 444
False negatives: 37
Total time: 588.612 s
Run 3
Average precision: 0.891
True positives: 483
False positives: 253
False negatives: 28
Total time: 592.481 s
Average precision: 0.923
True positives: 485
False positives: 261
False negatives: 26
Total time: 601.623 s
Average precision: 0.923
True positives: 485
False positives: 141
False negatives: 26
Total time: 1569.855 s
Run 2
Average precision: 0.934
True positives: 489
False positives: 386
False negatives: 22
Total time: 1602.842 s
Run 3
Average precision: 0.913
True positives: 481
False positives: 256
False negatives: 30
Total time: 1688.123 s
It is clear that lowering the cell size improves performance and granularity of the pipeline. As expected, a cell size of 3
performed the best, with an average precision of 0.923
and lows in both false positives and false negatives. However, there is a large time tradeoff when using such a granular cell size, which is time. The time for the 3 cell size pipeline takes nearly 30 minutes, while in comparison, the 6 cell size pipeline takes around 3 minutes. Given that tradeoff, however, the improvement in both the false positives and false negatives can be desirable when time is not an issue.
The multi-scale detection's improvement over single-scale detection was very significant, and varying the cell sizes further tuned the results to be slightly more accurate. However, one of the glaring issues was the large amount of false positives, so in order to attempt to reduce the number of false positives, the confidence threshold for a positive match was varied.
Threshold: 0.75
True positives: 467
False positives: 431
False negatives: 44
Threshold: 0.5
True positives: 478
False positives: 787
False negatives: 33
Threshold: 0.75
True positives: 467
False positives: 431
False negatives: 44
Threshold: 0.80
True positives: 452
False positives: 344
False negatives: 59
Threshold: 0.85
True positives: 469
False positives: 388
False negatives: 42
Threshold: 0.90
True positives: 470
False positives: 589
False negatives: 41
Varying the confidence interval threshold did not really improve the false positives, but there was a sweet spot for true positives and false positives around 0.85
as the threshold. The false positive non-improvement was likely due to the classifier being very confident on the false positive matches. As a result, there was not much improvement from before in that area.
Threshold: 0.85
True positives: 452
False positives: 344
False negatives: 59
Note: The confidence level was tested on a cell size of 6
, but the best result was then generated using a cell size of 3
for practical time reasons.
Because changing the confidence threshold did not seem to really lower the false positive values, the SVM lambda value was tuned instead to see if there were better results. Perhaps the learning algorithm was not classifying the information with the correct parameters.
Threshold: 0.0001
True positives: 469
False positives: 388
False negatives: 42
Lambda: 0.00001
True positives: 477
False positives: 924
False negatives: 34
Lambda: 0.000075
True positives: 467
False positives: 379
False negatives: 44
Lambda: 0.0001
True positives: 469
False positives: 388
False negatives: 42
Lambda: 0.001
True positives: 457
False positives: 417
False negatives: 54
Lambda: 0.01
True positives: 431
False positives: 612
False negatives: 80
Again, there seemed to be very little improvement from the baseline, but arguably the best performing lambda value was the same as the best performing value from the previous project (0.000075
). This value was chosen as the default for the best results.
Lambda: 0.000075
True positives: 469
False positives: 187
False negatives: 42
Note: The confidence level was tested on a cell size of 6
, but the best result was then generated using a cell size of 3
for practical time reasons.
These were the best results, using the best parameters found in each experiment. The parameters of note were lambda value of 0.000075
, threshold of 0.85
, and cell size of 3
pixels.
Histogram of Gradients.
Precision graph.
Cards Perp: A strong showing from the algorithm!
Arsenal: One of the most difficult images in the testing set that performed relatively well, despite the many false positives.
Jackson: Arguably one of the most difficult images in the testing set, with tough lighting for the algorithm to deal with.
Good performing ones on top and troublesome results on bottom.