Project 5 / Face Detection with a Sliding Window

The sliding window model independently classifies all image patches as being object or non-object. Here, we implement the sliding window detector of Dalal and Triggs 2005. Our train data set contains 6,713 cropped 36x36 from Caltech web Faces project. For negative images, we randomly sample 36x36 patches of scenes from Wu et al. and the SUN scene database. We use an improved version of HoG descriptors proposed by Dalal and Triggs 2005. Our test set is CMU+MIT test set which contains 130 images with 511 faces.

Feature Generation and Classifier Training

First, we load cropped positive trained examples (faces) and convert them to HoG features with a call to vl_hog. Here, we also include mirrored version of the images to augment our dataset. We use vl_hog from VLFeat to get feature descriptors for our images. We use a template size of 36x36 and 9 orientations and UoCTTI HoG variant unless specified otherwise. The cell size is varied in experiments.


image_files = dir( fullfile( train_path_pos, '*.jpg') ); %Caltech Faces stored as .jpg
num_images = length(image_files);

D = (feature_params.template_size / feature_params.hog_cell_size)^2 * 31;
cell_size = feature_params.hog_cell_size;

features_pos = zeros(num_images, D);
mirror_features_pos = zeros(num_images, D);

parfor i=1:num_images
    img = imread(fullfile(train_path_pos, image_files(i).name));
    if size(img, 3) == 3
        img = rgb2gray(img);
    end
    img = im2single(img);
    hog = vl_hog(img, cell_size, 'NumOrientations',...
        feature_params.num_orientations, 'Variant', feature_params.variant);
    mirror_hog = vl_hog(flip(img, 2), feature_params.hog_cell_size,...
        'NumOrientations', feature_params.num_orientations, 'Variant', feature_params.variant);
    features_pos(i, :) =  reshape(hog, 1, []);
    mirror_features_pos(i, :) =  reshape(mirror_hog, 1, []);
end
features_pos = [features_pos; mirror_features_pos];

Then, we sample random negative examples from scenes which contain no faces and convert them to HoG features. First, we randomly select an image. Then, we rescale it to a random scale since our template size is fixed. We select a random number of patches. For each patch, we select a random bounding box top-left corner to get a negative example. This process is repeated until we get the desired number of negative examples.


D = (feature_params.template_size / feature_params.hog_cell_size)^2 * 31;
features_neg = zeros(num_samples, D);
patches_per_image = num_samples / num_images;
t_size = feature_params.template_size;
cell_size = feature_params.hog_cell_size;

i = 1;
while i <= num_samples
    % Random image
    idx = ceil(rand(1) * num_images);
    img = imread(fullfile(non_face_scn_path, image_files(idx).name));
    if size(img, 3) == 3
        img = rgb2gray(img);
    end
    img = im2single(img);
    
    % Random number of patches
    num_patches = round(rand(1) * 2 * patches_per_image);
    if (i + num_patches - 1 > num_samples)
        num_patches = num_samples - i + 1;
    end
    
    for j = 1 : num_patches
        % Random scale for each patch
        min_scale = t_size / min(size(img, 1), size(img, 2));
        scale = min_scale + rand(1) * (1 - min_scale);
        scaled_img = imresize(img, scale);
        % Random crop of scaled image as patch
        [h, w] = size(scaled_img);
        x = ceil(rand(1) * (h - t_size));
        y = ceil(rand(1) * (w - t_size));
        hog = vl_hog(scaled_img(x: x + t_size - 1, y: y + t_size -1),...
            cell_size, 'NumOrientations', feature_params.num_orientations,...
            'Variant', feature_params.variant);
        features_neg(i, :) = reshape(hog, 1, []);
        i = i + 1;
    end
end

Finally we train a linear SVM classifier on the features. Here, we selected lambda = 0.0025. Varying lambda did not have much effect on average precision with slight variations of +/- 2% for lambda values from 0.005 to 0.0001. We erred on the side of caution and stopped when the performance on the training set settled at around 95% accuracy to avoid overfitting.


features_pos = get_positive_features( train_path_pos, feature_params );
num_negative_examples = 20000;
features_neg = get_random_negative_features( non_face_scn_path, feature_params, num_negative_examples);
X = [features_pos; features_neg];
Y = [ones(size(features_pos, 1), 1); -1 * ones(size(features_neg, 1), 1)];
lambda = 0.0025;
[w, b] = vl_svmtrain(X', Y, lambda);

The learned HoG template is visualized below. For this visualization, we used a cell size of 3.

Multiscale Sliding Window Face Detector

Finally, we run the classifier on the test set to evaluate performance. For each image, we run the classifier at multiple scales and then use non-maximum suppression to remove duplicate detections. We run a sliding window of (template_size / cell_size) over the HoG features computed for each image. Since our template size is fixed at 36x36, we resize the image to multiple scales and run the sliding window detector to find hits at each scale. This is a naive implementation which ensures that we evaluate all possible face candidates at the cost of runtime. Performance can be greatly improved by using the cascade architecture proposed by Viola-Jones 2001. For each image at each scale, we compute possible positive face candidates and their corresponding confidences using our trained SVM classifier. We keep candidate bounding boxes when the confidence is above a certain threshold. Finally, we use remove duplicate bounding boxes around face candidates for an image using non-maximal supression to extract a single candidate bounding box.

Here we use scales from 0.1, 0.2, ..., 1.0. We also use a step size or stride which is rescaled as the image itself is rescaled in order to do a coarser and faster detection.


t_size = feature_params.template_size;
cell_size = feature_params.hog_cell_size;
window_size = t_size / cell_size;
scales = 1:-0.1:0.1;
step = feature_params.step_size;

bboxes = zeros(0,4);
confidences = zeros(0,1);
image_ids = cell(0,1);

threshold = -1.5;

for i = 1:length(test_scenes)
    % Read image

    cur_confidences = [];
    cur_bboxes = [];
    cur_image_ids = [];
    
    for scale = scales
        scaled_img = imresize(img, scale);
        [img_r, img_c] = size(scaled_img);
        if img_r < t_size || img_c < t_size
        	% Exit if image is smaller than template size
            break
        end

        hog = vl_hog(scaled_img, cell_size,...
            'NumOrientations', feature_params.num_orientations, ...
            'Variant', feature_params.variant);
        [hog_r, hog_c, ~] = size(hog);
        
        scaled_step = max(floor(step * scale), 1);
        
        for r_start = 1:scaled_step:hog_r - window_size
            for c_start = 1:scaled_step:hog_c - window_size
                r_end = r_start + window_size - 1;
                c_end = c_start + window_size - 1;
                
                % Get patch features
                hog_feat = reshape(hog(r_start:r_end, c_start:c_end, :), 1, []);
        
                % Threshold confidence
                confidence = hog_feat * w + b;
                if confidence < threshold
                    continue
                end
                
                % Bounding box
                act_row_start = 1 + (r_start - 1) * cell_size;
                act_col_start = 1 + (c_start - 1) * cell_size;
                act_row_end = act_row_start + t_size - 1;
                act_col_end = act_col_start + t_size - 1;
                bbox = [act_col_start act_row_start act_col_end act_row_end];
                bbox = bbox / scale;
                
                cur_bboxes = [cur_bboxes; bbox];
                cur_confidences = [cur_confidences; confidence];
                cur_image_ids = [cur_image_ids; {test_scenes(i).name}];
            end
        end
    end
    
    if size(cur_bboxes, 1) == 0
        continue
    end
 	
 	% Non-maximum suppression to eliminate duplicates   
    [is_maximum] = non_max_supr_bbox(cur_bboxes, cur_confidences, size(img));

    cur_confidences = cur_confidences(is_maximum,:);
    cur_bboxes      = cur_bboxes(     is_maximum,:);
    cur_image_ids   = cur_image_ids(  is_maximum,:);
 
    bboxes      = [bboxes;      cur_bboxes];
    confidences = [confidences; cur_confidences];
    image_ids   = [image_ids;   cur_image_ids];
end

Performance

Here, we tried to tune mainly the cell size and step. We report average precision for various settings of these parameters. As discussed above, varying lambda for the SVM classifier did not have much of an effect. So, we have not reported those numbers. Also, we noticed a severe degradation in performance for step sizes 6 and above and have thus not reported average precision for that.

Cell SizeStep SizeAverage Precision (%)
6186.7%
6381.6%
4191.5%
4387.7%
3191.9%
3389.0%

The precision-recall curve for cell size 3 and step size 1 is shown below.

Here are some sample detections on an image.
This is for cell size 6 and step size 3. Note that bounding boxes are not very accurate and one face is not detected.
This is for cell size 3 and step size 1. Note that bounding boxes are very accurate.

Finally, we note that reducing the cell size and the step size greatly increase run time and we need to consider the tradeoff between accuracy and run time if we were to deploy such a system.

Graduate/Extra Credit

Hard negative mining

We implemented hard negative mining, as discussed in Dalal and Triggs which should actually be called adversarial training. Here, we run our face detector on negative examples and use patches identified as faces with a relatively high confidence as new negative training examples.


t_size = feature_params.template_size;
cell_size = feature_params.hog_cell_size;
window_size = t_size / cell_size;
scales = 1:-0.1:0.1;
threshold = 0;

features_neg = [];
confidences = [];

for i = 1:length(test_scenes)
    % Read image
    cur_confidences = [];
    cur_bboxes = [];
    hog_feats = [];
    for scale = scales
        scaled_img = imresize(img, scale);
        [img_r, img_c] = size(scaled_img);
        
        if img_r < t_size || img_c < t_size
            break
        end
        
        hog = vl_hog(scaled_img, cell_size,...
            'NumOrientations', feature_params.num_orientations, ...
            'Variant', feature_params.variant);
        [hog_r, hog_c, ~] = size(hog);
        
        for r_start = 1:hog_r - window_size
            for c_start = 1:hog_c - window_size
                r_end = r_start + window_size - 1;
                c_end = c_start + window_size - 1;
                
                % Get patch features
                hog_feat = reshape(hog(r_start:r_end, c_start:c_end, :), 1, []);
                
                % Threshold confidence
                confidence = hog_feat * w + b;
                if confidence < threshold
                    continue
                end
                
                % Bounding box
                act_row_start = 1 + (r_start - 1) * cell_size;
                act_col_start = 1 + (c_start - 1) * cell_size;
                act_row_end = act_row_start + t_size - 1;
                act_col_end = act_col_start + t_size - 1;
                bbox = [act_col_start act_row_start act_col_end act_row_end];
                bbox = bbox / scale;
                
                cur_bboxes = [cur_bboxes; bbox];
                cur_confidences = [cur_confidences; confidence];
                hog_feats = [hog_feats; hog_feat];
            end
        end
    end
    if size(cur_bboxes, 1) == 0
        continue
    end
    % Non-maximum suppressioin
    [is_maximum] = non_max_supr_bbox(cur_bboxes, cur_confidences, size(img));
    hog_feats = hog_feats(is_maximum, :);
    cur_confidences = cur_confidences(is_maximum, :);
    features_neg = [features_neg; hog_feats];
    confidences = [confidences; cur_confidences];
end

if size(features_neg, 1) < num_hard_neg_examples
    % Oversample to get required number
    features_neg = datasample(features_neg, num_hard_neg_examples);
else
    % Take hardest examples
    [~, idx] = sort(confidences, 'descend');
    features_neg = features_neg(idx, :);
    features_neg = features_neg(1:num_hard_neg_examples, :);
end

To isolate the effect of hard negative mining, we keep the same number of negative training examples and replace half of the negative training examples with samples we get from hard negative mining.


if hard_neg_mining
    num_hard_neg_examples = 10000;
    features_hard_neg = hard_negative_mining(non_face_scn_path, w, b, feature_params, num_hard_neg_examples);
    sampled_neg = datasample(features_neg, num_negative_examples - num_hard_neg_examples);
    new_features_neg = [sampled_neg; features_hard_neg];
    X = [features_pos; features_neg];
    Y = [ones(size(features_pos, 1), 1); -1 * ones(size(features_neg, 1), 1)];
    [w, b] = vl_svmtrain(X', Y, lambda);
end

Hard Negative Mining was not very effective on this dataset. It led to a modest improvement with higher cell sizes and step sizes with little improvement to slight degradation with lower cell sizes and step sizes.

Without hard negative miningWith hard negative mining
Cell SizeStep SizeAverage Precision (%)Average Precision (%)
6186.2%87.1%
6381.6%83.7%
4191.5%91.6%
4387.7%87.7%
3191.9%91.7%
3389.0%90.1%

Alternative positive examples

We use a different dataset, Labeled Faces in the Wild, a database of face photographs designed for studying the problem of unconstrained face recognition. The data set contains more than 13,000 images of faces collected from the web. Each face has been labeled with the name of the person pictured. 1680 of the people pictured have two or more distinct photos in the data set. We use a cropped version of the dataset which focusses only on faces. The dataset has also been converted to grayscale.

Using purely this dataset leads to an average precision of only 55.1%. This is likely due to the inefficiencies in cropping as most images in the database have been cropped such that the person's hair and neck area are also removed which is slightly different from our bounding boxes in the test set

Augmenting our dataset modestly with 2000 of these images leads to a slight improvement. For cell size 6 and step size 1, the average precision goes up from 86.2% to 87.6%. Using the full dataset to augment our original dataset leads to a degradation to 84.3%.


features_pos = get_positive_features( train_path_pos, feature_params );
% features_pos = [];
new_features_pos = datasample(get_positive_features( new_train_path_pos, feature_params ), 2000);
features_pos = [features_pos; new_features_pos];

num_negative_examples = 20000; %Higher will work strictly better, but you should start with 10000 for debugging
features_neg = get_random_negative_features( non_face_scn_path, feature_params, num_negative_examples);
features_pos = datasample(features_pos, size(features_neg, 1));

The precision-recall curve for cell size 6 and step size 1 with augmentation with 2000 images from the new dataset is shown below.

Without augmentation, for the same setup, we get the curve shown below.

Other classification schemes

We tested decision trees and random forest classifiers for this task but we found that they were extremely slow and led to too many false positives. The default decision tree in Matlab led to an average precision of 10.1%. So this avenue was not explored further even though we have the code for it.