The sliding window model independently classifies all image patches as being object or non-object. Here, we implement the sliding window detector of Dalal and Triggs 2005. Our train data set contains 6,713 cropped 36x36 from Caltech web Faces project. For negative images, we randomly sample 36x36 patches of scenes from Wu et al. and the SUN scene database. We use an improved version of HoG descriptors proposed by Dalal and Triggs 2005. Our test set is CMU+MIT test set which contains 130 images with 511 faces.
image_files = dir( fullfile( train_path_pos, '*.jpg') ); %Caltech Faces stored as .jpg
num_images = length(image_files);
D = (feature_params.template_size / feature_params.hog_cell_size)^2 * 31;
cell_size = feature_params.hog_cell_size;
features_pos = zeros(num_images, D);
mirror_features_pos = zeros(num_images, D);
parfor i=1:num_images
img = imread(fullfile(train_path_pos, image_files(i).name));
if size(img, 3) == 3
img = rgb2gray(img);
end
img = im2single(img);
hog = vl_hog(img, cell_size, 'NumOrientations',...
feature_params.num_orientations, 'Variant', feature_params.variant);
mirror_hog = vl_hog(flip(img, 2), feature_params.hog_cell_size,...
'NumOrientations', feature_params.num_orientations, 'Variant', feature_params.variant);
features_pos(i, :) = reshape(hog, 1, []);
mirror_features_pos(i, :) = reshape(mirror_hog, 1, []);
end
features_pos = [features_pos; mirror_features_pos];
Then, we sample random negative examples from scenes which contain no faces and convert them to HoG features. First, we randomly select an image. Then, we rescale it to a random scale since our template size is fixed. We select a random number of patches. For each patch, we select a random bounding box top-left corner to get a negative example. This process is repeated until we get the desired number of negative examples.
D = (feature_params.template_size / feature_params.hog_cell_size)^2 * 31;
features_neg = zeros(num_samples, D);
patches_per_image = num_samples / num_images;
t_size = feature_params.template_size;
cell_size = feature_params.hog_cell_size;
i = 1;
while i <= num_samples
% Random image
idx = ceil(rand(1) * num_images);
img = imread(fullfile(non_face_scn_path, image_files(idx).name));
if size(img, 3) == 3
img = rgb2gray(img);
end
img = im2single(img);
% Random number of patches
num_patches = round(rand(1) * 2 * patches_per_image);
if (i + num_patches - 1 > num_samples)
num_patches = num_samples - i + 1;
end
for j = 1 : num_patches
% Random scale for each patch
min_scale = t_size / min(size(img, 1), size(img, 2));
scale = min_scale + rand(1) * (1 - min_scale);
scaled_img = imresize(img, scale);
% Random crop of scaled image as patch
[h, w] = size(scaled_img);
x = ceil(rand(1) * (h - t_size));
y = ceil(rand(1) * (w - t_size));
hog = vl_hog(scaled_img(x: x + t_size - 1, y: y + t_size -1),...
cell_size, 'NumOrientations', feature_params.num_orientations,...
'Variant', feature_params.variant);
features_neg(i, :) = reshape(hog, 1, []);
i = i + 1;
end
end
Finally we train a linear SVM classifier on the features. Here, we selected lambda = 0.0025. Varying lambda did not have much effect on average precision with slight variations of +/- 2% for lambda values from 0.005 to 0.0001. We erred on the side of caution and stopped when the performance on the training set settled at around 95% accuracy to avoid overfitting.
features_pos = get_positive_features( train_path_pos, feature_params );
num_negative_examples = 20000;
features_neg = get_random_negative_features( non_face_scn_path, feature_params, num_negative_examples);
X = [features_pos; features_neg];
Y = [ones(size(features_pos, 1), 1); -1 * ones(size(features_neg, 1), 1)];
lambda = 0.0025;
[w, b] = vl_svmtrain(X', Y, lambda);
The learned HoG template is visualized below. For this visualization, we used a cell size of 3.
Finally, we run the classifier on the test set to evaluate performance. For each image, we run the classifier at multiple scales and then use non-maximum suppression to remove duplicate detections. We run a sliding window of (template_size / cell_size) over the HoG features computed for each image. Since our template size is fixed at 36x36, we resize the image to multiple scales and run the sliding window detector to find hits at each scale. This is a naive implementation which ensures that we evaluate all possible face candidates at the cost of runtime. Performance can be greatly improved by using the cascade architecture proposed by Viola-Jones 2001. For each image at each scale, we compute possible positive face candidates and their corresponding confidences using our trained SVM classifier. We keep candidate bounding boxes when the confidence is above a certain threshold. Finally, we use remove duplicate bounding boxes around face candidates for an image using non-maximal supression to extract a single candidate bounding box.
Here we use scales from 0.1, 0.2, ..., 1.0. We also use a step size or stride which is rescaled as the image itself is rescaled in order to do a coarser and faster detection.
t_size = feature_params.template_size;
cell_size = feature_params.hog_cell_size;
window_size = t_size / cell_size;
scales = 1:-0.1:0.1;
step = feature_params.step_size;
bboxes = zeros(0,4);
confidences = zeros(0,1);
image_ids = cell(0,1);
threshold = -1.5;
for i = 1:length(test_scenes)
% Read image
cur_confidences = [];
cur_bboxes = [];
cur_image_ids = [];
for scale = scales
scaled_img = imresize(img, scale);
[img_r, img_c] = size(scaled_img);
if img_r < t_size || img_c < t_size
% Exit if image is smaller than template size
break
end
hog = vl_hog(scaled_img, cell_size,...
'NumOrientations', feature_params.num_orientations, ...
'Variant', feature_params.variant);
[hog_r, hog_c, ~] = size(hog);
scaled_step = max(floor(step * scale), 1);
for r_start = 1:scaled_step:hog_r - window_size
for c_start = 1:scaled_step:hog_c - window_size
r_end = r_start + window_size - 1;
c_end = c_start + window_size - 1;
% Get patch features
hog_feat = reshape(hog(r_start:r_end, c_start:c_end, :), 1, []);
% Threshold confidence
confidence = hog_feat * w + b;
if confidence < threshold
continue
end
% Bounding box
act_row_start = 1 + (r_start - 1) * cell_size;
act_col_start = 1 + (c_start - 1) * cell_size;
act_row_end = act_row_start + t_size - 1;
act_col_end = act_col_start + t_size - 1;
bbox = [act_col_start act_row_start act_col_end act_row_end];
bbox = bbox / scale;
cur_bboxes = [cur_bboxes; bbox];
cur_confidences = [cur_confidences; confidence];
cur_image_ids = [cur_image_ids; {test_scenes(i).name}];
end
end
end
if size(cur_bboxes, 1) == 0
continue
end
% Non-maximum suppression to eliminate duplicates
[is_maximum] = non_max_supr_bbox(cur_bboxes, cur_confidences, size(img));
cur_confidences = cur_confidences(is_maximum,:);
cur_bboxes = cur_bboxes( is_maximum,:);
cur_image_ids = cur_image_ids( is_maximum,:);
bboxes = [bboxes; cur_bboxes];
confidences = [confidences; cur_confidences];
image_ids = [image_ids; cur_image_ids];
end
Here, we tried to tune mainly the cell size and step. We report average precision for various settings of these parameters. As discussed above, varying lambda for the SVM classifier did not have much of an effect. So, we have not reported those numbers. Also, we noticed a severe degradation in performance for step sizes 6 and above and have thus not reported average precision for that.
Cell Size | Step Size | Average Precision (%) |
---|---|---|
6 | 1 | 86.7% |
6 | 3 | 81.6% |
4 | 1 | 91.5% |
4 | 3 | 87.7% |
3 | 1 | 91.9% |
3 | 3 | 89.0% |
The precision-recall curve for cell size 3 and step size 1 is shown below.
Here are some sample detections on an image.
This is for cell size 6 and step size 3. Note that bounding boxes are not very accurate and one face is not detected.
This is for cell size 3 and step size 1. Note that bounding boxes are very accurate.
Finally, we note that reducing the cell size and the step size greatly increase run time and we need to consider the tradeoff between accuracy and run time if we were to deploy such a system.
We implemented hard negative mining, as discussed in Dalal and Triggs which should actually be called adversarial training. Here, we run our face detector on negative examples and use patches identified as faces with a relatively high confidence as new negative training examples.
t_size = feature_params.template_size;
cell_size = feature_params.hog_cell_size;
window_size = t_size / cell_size;
scales = 1:-0.1:0.1;
threshold = 0;
features_neg = [];
confidences = [];
for i = 1:length(test_scenes)
% Read image
cur_confidences = [];
cur_bboxes = [];
hog_feats = [];
for scale = scales
scaled_img = imresize(img, scale);
[img_r, img_c] = size(scaled_img);
if img_r < t_size || img_c < t_size
break
end
hog = vl_hog(scaled_img, cell_size,...
'NumOrientations', feature_params.num_orientations, ...
'Variant', feature_params.variant);
[hog_r, hog_c, ~] = size(hog);
for r_start = 1:hog_r - window_size
for c_start = 1:hog_c - window_size
r_end = r_start + window_size - 1;
c_end = c_start + window_size - 1;
% Get patch features
hog_feat = reshape(hog(r_start:r_end, c_start:c_end, :), 1, []);
% Threshold confidence
confidence = hog_feat * w + b;
if confidence < threshold
continue
end
% Bounding box
act_row_start = 1 + (r_start - 1) * cell_size;
act_col_start = 1 + (c_start - 1) * cell_size;
act_row_end = act_row_start + t_size - 1;
act_col_end = act_col_start + t_size - 1;
bbox = [act_col_start act_row_start act_col_end act_row_end];
bbox = bbox / scale;
cur_bboxes = [cur_bboxes; bbox];
cur_confidences = [cur_confidences; confidence];
hog_feats = [hog_feats; hog_feat];
end
end
end
if size(cur_bboxes, 1) == 0
continue
end
% Non-maximum suppressioin
[is_maximum] = non_max_supr_bbox(cur_bboxes, cur_confidences, size(img));
hog_feats = hog_feats(is_maximum, :);
cur_confidences = cur_confidences(is_maximum, :);
features_neg = [features_neg; hog_feats];
confidences = [confidences; cur_confidences];
end
if size(features_neg, 1) < num_hard_neg_examples
% Oversample to get required number
features_neg = datasample(features_neg, num_hard_neg_examples);
else
% Take hardest examples
[~, idx] = sort(confidences, 'descend');
features_neg = features_neg(idx, :);
features_neg = features_neg(1:num_hard_neg_examples, :);
end
To isolate the effect of hard negative mining, we keep the same number of negative training examples and replace half of the negative training examples with samples we get from hard negative mining.
if hard_neg_mining
num_hard_neg_examples = 10000;
features_hard_neg = hard_negative_mining(non_face_scn_path, w, b, feature_params, num_hard_neg_examples);
sampled_neg = datasample(features_neg, num_negative_examples - num_hard_neg_examples);
new_features_neg = [sampled_neg; features_hard_neg];
X = [features_pos; features_neg];
Y = [ones(size(features_pos, 1), 1); -1 * ones(size(features_neg, 1), 1)];
[w, b] = vl_svmtrain(X', Y, lambda);
end
Hard Negative Mining was not very effective on this dataset. It led to a modest improvement with higher cell sizes and step sizes with little improvement to slight degradation with lower cell sizes and step sizes.
Without hard negative mining | With hard negative mining | ||
---|---|---|---|
Cell Size | Step Size | Average Precision (%) | Average Precision (%) |
6 | 1 | 86.2% | 87.1% |
6 | 3 | 81.6% | 83.7% |
4 | 1 | 91.5% | 91.6% |
4 | 3 | 87.7% | 87.7% |
3 | 1 | 91.9% | 91.7% |
3 | 3 | 89.0% | 90.1% |
We use a different dataset, Labeled Faces in the Wild, a database of face photographs designed for studying the problem of unconstrained face recognition. The data set contains more than 13,000 images of faces collected from the web. Each face has been labeled with the name of the person pictured. 1680 of the people pictured have two or more distinct photos in the data set. We use a cropped version of the dataset which focusses only on faces. The dataset has also been converted to grayscale.
Using purely this dataset leads to an average precision of only 55.1%. This is likely due to the inefficiencies in cropping as most images in the database have been cropped such that the person's hair and neck area are also removed which is slightly different from our bounding boxes in the test set
Augmenting our dataset modestly with 2000 of these images leads to a slight improvement. For cell size 6 and step size 1, the average precision goes up from 86.2% to 87.6%. Using the full dataset to augment our original dataset leads to a degradation to 84.3%.
features_pos = get_positive_features( train_path_pos, feature_params );
% features_pos = [];
new_features_pos = datasample(get_positive_features( new_train_path_pos, feature_params ), 2000);
features_pos = [features_pos; new_features_pos];
num_negative_examples = 20000; %Higher will work strictly better, but you should start with 10000 for debugging
features_neg = get_random_negative_features( non_face_scn_path, feature_params, num_negative_examples);
features_pos = datasample(features_pos, size(features_neg, 1));
The precision-recall curve for cell size 6 and step size 1 with augmentation with 2000 images from the new dataset is shown below.
Without augmentation, for the same setup, we get the curve shown below.
We tested decision trees and random forest classifiers for this task but we found that they were extremely slow and led to too many false positives. The default decision tree in Matlab led to an average precision of 10.1%. So this avenue was not explored further even though we have the code for it.