Positive training examples were taken from a database of 6,713 cropped 36x36 faces from the Caltech Web Faces project. On the other hand, 10,000 negative examples were taken from randomly cropped 36x36 segments from Wu et al. and the SUN scene database.
Histogram of Gradients (HoG) is a feature used to characterize faces as part of the detection pipeline. The figure below shows an example of a learned template for facial features.
%HoG features
for i=1:num_images
f = fullfile(train_path_pos,image_files(i).name);
I = single(imread(f)) / 255;
hog = vl_hog(I,feature_params.hog_cell_size);
features_pos(i,:) = reshape(hog,1,D);
end
A linear SVM classifier is trained on the collected positive and negative training examples in HoG feature space. The result is stored in a weight vector and a bias. The graph below shows a visualization of the separation between positive and negative training examples.
%SVM classifier
X = [features_pos' features_neg'];
Y = [ones(1,size(features_pos,1)) -ones(1,size(features_neg,1))];
[w,b] = vl_svmtrain(X, Y, lambda);
The classifier is applied at test time in the framework of a multiscale sliding window detector. HoG features are first calculated from the test image and then convolved with the SVM weights. The resulting score matrix indicates the confidence of a particular window being detected as a face. The score matrix is then filtered for values that are greater than a pre-defined absolute threshold. Detections that are too close to the boundary are rejected. The derived x and y coordinates are used to form bounding boxes around detected regions. This process is repeated at multiple scales in order to account for faces of different sizes. Finally, non-maximum suppression is used to combine overlapping bounding boxes.
%Multiscale sliding window detector
for k=1:feature_params.scales
hog = vl_hog(img,feature_params.hog_cell_size);
for x=1:size(hog,1)-numCells+1
for y=1:size(hog,2)-numCells+1
score = reshape(hog(x:x+numCells-1,y:y+numCells-1,:),1,numCells*numCells*31)*w + b;
if score > feature_params.abs_threshold
x_min = (y-1) * feature_params.hog_cell_size + 1;
y_min = (x-1) * feature_params.hog_cell_size + 1;
bb = [x_min,y_min,x_min+feature_params.template_size-1,y_min+feature_params.template_size-1] / (scale_factor ^ (k-1));
cur_bboxes(l,:) = bb;
cur_confidences(l) = score;
end
end
end
img = imresize(img, scale_factor, 'bilinear');
end
The results below are obtained using a multiscale detector with cell size of 3, lambda parameter of 0.0001, and absolute threshold of -0.5. The graphs shown are that of average precision and recall vs false positives.
The results are also compared between single-scale, multi-scale, and detectors with different cell sizes.
Detector | Average precision (AP) |
---|---|
Single scale cell size = 6 | 0.331 |
Multiscale cell size = 6 | 0.714 |
Multiscale cell size = 4 | 0.760 |
Multiscale cell size = 3 | 0.762 |
The technique of hard negative mining is used to train the classifier with respect to challenging negative examples in order to improve the discriminative ability of the classifier. A face detector is first trained with random sampling and used to classify images from the negative training samples. Samples with positive scores (hard negatives) are then used to train a separate classifier. In the table below, the number of negative training samples is fixed at 100,1000,and 10000, and the average precision is compared between random sampling and hard negative mining. Based on the results, hard negative mining does not have a positive effect on the average precision. This may be because the hard negative examples lower the overall confidence level, so fewer true positive detections are made.
# -ve Training samples | Average precision (random sampling) | Average precision (hard negative mining) |
---|---|---|
100 | 0.594 | 0.577 |
1000 | 0.639 | 0.577 |
10000 | 0.665 | 0.537 |
Data augmentation is used to increase the number of training examples when training data is limited. In this case, new training samples are artificially created by perturbing existing training samples by various transformations such as horizontal flip and brightness scaling. The results show that data augmentation does not significantly increase the average precision. This suggests that the original dataset already contained enough variation that data augmentation does not create any new information.
Detector | # +ve Training samples | # -ve Training samples | Average precision (AP) |
---|---|---|---|
Baseline | 6713 | 10000 | 0.665 |
Horizontal Flip | 13426 | 20000 | 0.636 |
Brightness Scaling | 13426 | 20000 | 0.663 |
Horizontal Flip + Brightness Scaling | 20139 | 30000 | 0.668 |