Project 5 / Face Detection with a Sliding Window

Face detection is one of the most successful computer vision applications. As an important preprocessing front end, it plays a significant role in improving a face recognition system. In this project, I implemente a baseline sliding window face detection algorithm. I also finished an extra credit by augmenting positive training data.

My design decision is summarized as follows:

  1. I first simply use the default settings (single scale) to establish a face detection baseline, which works ok with 0.42 average precision.
  2. Then I observe the detection results produced by the baseline, I find the detector can not adaptive the large faces. It is because the single scale is not powerful enough. Thus I implement multi-scale version with a little bit parameter tuning. The average precision becomes 0.607.
  3. After implementing multi-scale, I try to improve the performance of linear SVM. Therefore, I implement an extra credit (augmenting the positive training samples by mirroring the faces) and then increase the negative training samples to 60000. The average precision becomes 0.722.
  4. I further use some tricks. The average precision becomes 0.727.

Baseline Sliding Window Face Detection

I set the the number of negative training samples as 10000, the lambda for linear SVM as 0.0001, HoG cell size as 6, and template size as 36. For the detector, I go through the entire HoG feature maps by step size 1 (namely 6 pixels since HoG cell size is 6) and the threshold for linear SVM is -0.2. I only use singel scale in the baseline. The baseline achieves the average precision of 0.42.

The code of function get_positive_features.m is given as follows:


% extract HoG features from positive training samples (faces).
function features_pos = get_positive_features(train_path_pos, feature_params)
    image_files = dir( fullfile( train_path_pos, '*.jpg') ); %Caltech Faces stored as .jpg
    num_images = length(image_files);
    for i = 1:num_images
        crop_face = imread(strcat(train_path_pos,'\',image_files(i).name));
        crop_face_feat = vl_hog(single(crop_face)/255,feature_params.hog_cell_size);
        features_pos(i,:) = reshape(crop_face_feat,1,numel(crop_face_feat));
    end
end

The code of function get_random_negative_features.m is given as follows:


% extract HoG features from positive training samples (faces).
function features_neg = get_random_negative_features(non_face_scn_path, feature_params, num_samples)
    image_files = dir( fullfile( non_face_scn_path, '*.jpg' ));
    num_images = length(image_files);
    step_i = 20;
    step_j = 20;
    k = 1;
    for i = 1:num_images
        non_face = rgb2gray(imread(strcat(non_face_scn_path,'\',image_files(i).name)));
        for slide_i = 1:step_i:size(non_face,1)-feature_params.template_size+1
            for slide_j = 1:step_j:size(non_face,2)-feature_params.template_size+1
                non_face_patch{k} = non_face(slide_i:slide_i+feature_params.template_size-1,...
                    slide_j:slide_j+feature_params.template_size-1);
                k = k+1;
            end
        end
    end
    rand_perm = randperm(k-1);
    for i = 1:num_samples
        neg_feat = non_face_patch{i};
        non_face_feat = vl_hog(single(neg_feat)/255,feature_params.hog_cell_size);
        features_neg(i,:) = reshape(non_face_feat,1,numel(non_face_feat));
    end
end

The linear SVM is simply implemented by a few lines in proj5.m


X = [features_pos',features_neg'];
Y = [ones(size(features_pos,1),1);-ones(size(features_neg,1),1)];
[w b] = vl_svmtrain(X, Y, 0.0001);

The code of function run_detector.m is given as follows:


% face detector using sliding window (single scale).
function [bboxes, confidences, image_ids] = .... 
    run_detector(test_scn_path, w, b, feature_params)

test_scenes = dir( fullfile( test_scn_path, '*.jpg' ));

%initialize these as empty and incrementally expand them.
bboxes = zeros(0,4);
confidences = zeros(0,1);
image_ids = cell(0,1);

for i = 1:length(test_scenes)
    
    fprintf('Detecting faces in %s\n', test_scenes(i).name)
    img = imread( fullfile( test_scn_path, test_scenes(i).name ));
    img = single(img)/255;
    if(size(img,3) > 1)
        img = rgb2gray(img);
    end

    k = 1;
    cur_bboxes_candi = [];
    cur_x_min_candi = [];
    cur_y_min_candi = [];
    test_hog = [];
    scale_set = [];
    scale_preset = 1;%[0.6,1,1.2,1.4];
    
    for scale_i = 1:size(scale_preset,2)
        scale = scale_preset(scale_i);
        resize_img = imresize(img,scale);
        img_hog = vl_hog(resize_img,feature_params.hog_cell_size);
        step_i = 1;
        step_j = 1;
        for slide_i = 1:step_i:size(img_hog,1)-feature_params.hog_cell_size+1
            for slide_j = 1:step_i:size(img_hog,2)-feature_params.hog_cell_size+1
                test_hog(k,:) = reshape(img_hog(slide_i:slide_i+feature_params.hog_cell_size-1,...
                    slide_j:slide_j+feature_params.hog_cell_size-1,:),1,feature_params.hog_cell_size^2*size(img_hog,3));
                scale_set(k,1) = scale;
                cur_x_min_candi(k,1) = fix(slide_j*feature_params.hog_cell_size/scale);
                cur_y_min_candi(k,1) = fix(slide_i*feature_params.hog_cell_size/scale);
                k = k+1;
            end
        end
    end

    threshold = -0.2;
    cur_confidences_candi = test_hog*w + b;
    idx = find(cur_confidences_candi>threshold);
    cur_confidences = cur_confidences_candi(idx,1);
    cur_x_min = cur_x_min_candi(idx,1);
    cur_y_min = cur_y_min_candi(idx,1);
    cur_bboxes = [[cur_x_min, cur_y_min,...
        fix(cur_x_min + feature_params.template_size./scale_set(idx)),...
        fix(cur_y_min + feature_params.template_size./scale_set(idx))]];
    cur_image_ids = cell(size(idx,1),1);
    cur_image_ids(1:size(idx,1),1) = {test_scenes(i).name};
    
    [is_maximum] = non_max_supr_bbox(cur_bboxes, cur_confidences, size(img));

    cur_confidences = cur_confidences(is_maximum,:);
    cur_bboxes      = cur_bboxes(     is_maximum,:);
    cur_image_ids   = cur_image_ids(  is_maximum,:);
 
    bboxes      = [bboxes;      cur_bboxes];
    confidences = [confidences; cur_confidences];
    image_ids   = [image_ids;   cur_image_ids];
end

I visual the face detector and plot the precision-recall curve as follows(The average precision is 0.42):

To better show the detection results, we give a few detection examples:

Multi-Scale Face Detector

Different from the baseline, I implement the multi-scale face detector, making the detector adpative to mutiple scale faces. The average precision is further increased to 0.607. The threshold is set as -0.5 and the other parameters remain the same as baseline. To speed up the detection, we increase the step size from 1 HoG cell to 2 HoG cells.

The only different is the function run_detector.m which is given as follows:


% face detector using sliding window (multiple scale).
function [bboxes, confidences, image_ids] = .... 
    run_detector(test_scn_path, w, b, feature_params)
test_scenes = dir( fullfile( test_scn_path, '*.jpg' ));

bboxes = zeros(0,4);
confidences = zeros(0,1);
image_ids = cell(0,1);

for i = 1:length(test_scenes)
    
    fprintf('Detecting faces in %s\n', test_scenes(i).name)
    img = imread( fullfile( test_scn_path, test_scenes(i).name ));
    img = single(img)/255;
    if(size(img,3) > 1)
        img = rgb2gray(img);
    end

    k = 1;
    cur_bboxes_candi = [];
    cur_x_min_candi = [];
    cur_y_min_candi = [];
    test_hog = [];
    scale_set = [];
    scale_preset = [0.2,0.4,0.7,0.9];
    
    for scale_i = 1:size(scale_preset,2)
        scale = scale_preset(scale_i);
        resize_img = imresize(img,scale);
        img_hog = vl_hog(resize_img,feature_params.hog_cell_size);
        step_i = 2;
        step_j = 2;
        for slide_i = 1:step_i:size(img_hog,1)-feature_params.hog_cell_size+1
            for slide_j = 1:step_i:size(img_hog,2)-feature_params.hog_cell_size+1
                test_hog(k,:) = reshape(img_hog(slide_i:slide_i+feature_params.hog_cell_size-1,...
                    slide_j:slide_j+feature_params.hog_cell_size-1,:),1,feature_params.hog_cell_size^2*size(img_hog,3));
                scale_set(k,1) = scale;
                cur_x_min_candi(k,1) = fix(slide_j*feature_params.hog_cell_size/scale);
                cur_y_min_candi(k,1) = fix(slide_i*feature_params.hog_cell_size/scale);
                k = k+1;
            end
        end
    end

    threshold = -0.5;
    cur_confidences_candi = test_hog*w + b;
    idx = find(cur_confidences_candi>threshold);
    cur_confidences = cur_confidences_candi(idx,1);
    cur_x_min = cur_x_min_candi(idx,1);
    cur_y_min = cur_y_min_candi(idx,1);
    cur_bboxes = [[cur_x_min, cur_y_min,...
        fix(cur_x_min + feature_params.template_size./scale_set(idx)),...
        fix(cur_y_min + feature_params.template_size./scale_set(idx))]];
    cur_image_ids = cell(size(idx,1),1);
    cur_image_ids(1:size(idx,1),1) = {test_scenes(i).name};
    
    %anything in non_max_supr_bbox, but you can.
    [is_maximum] = non_max_supr_bbox(cur_bboxes, cur_confidences, size(img));

    cur_confidences = cur_confidences(is_maximum,:);
    cur_bboxes      = cur_bboxes(     is_maximum,:);
    cur_image_ids   = cur_image_ids(  is_maximum,:);
 
    bboxes      = [bboxes;      cur_bboxes];
    confidences = [confidences; cur_confidences];
    image_ids   = [image_ids;   cur_image_ids];
end

The precision-recall curve and some detection results are given as follows:

Multi-Scale Face Detector with Augmented Positive Training Samples

I further improve the performance of linear SVM by augmenting the positive training samples and increasing the negative training samples. The average precision is further boosted to 0.722. We visualize the face detector of three types of SVM. In the visualized face detectors, from the left, the first is the SVM trained with original positive samples and 10000 negative samples, the second is the SVM trainined with augmented training samples (original + mirrors) and 40000 negative samples, the third is the SVM trainined with augmented training samples and 60000 negative samples. We use the last one in this experiment. For the multiple scale, we increase the number of scales compared to the previous one. The get_positive_features.m function is modified as:


% get augmented positive training samples
function features_pos = get_positive_features(train_path_pos, feature_params)

    image_files = dir( fullfile( train_path_pos, '*.jpg') ); %Caltech Faces stored as .jpg
    num_images = length(image_files);

    % augment positive training samples
    for i = 1:num_images
        ori_face{1,i} = imread(strcat(train_path_pos,'\',image_files(i).name));
        [height,width,dim] = size(ori_face{i});  
        tform = maketform('affine',[-1 0 0;0 1 0;width 0 1]);
        aug_face{1,i} = imtransform(ori_face{i},tform,'nearest');
    end
    all_face = cat(2,ori_face,aug_face);
    for i = 1:size(all_face,2)
        crop_face = all_face{i};
        crop_face_feat = vl_hog(single(crop_face)/255,feature_params.hog_cell_size);
        features_pos(i,:) = reshape(crop_face_feat,1,numel(crop_face_feat));
    end
end

The code of run_detector.m is shown as follows:


% face detector using sliding window (multiple scale).
function [bboxes, confidences, image_ids] = .... 
    run_detector(test_scn_path, w, b, feature_params)
test_scenes = dir( fullfile( test_scn_path, '*.jpg' ));

bboxes = zeros(0,4);
confidences = zeros(0,1);
image_ids = cell(0,1);

for i = 1:length(test_scenes)
    
    fprintf('Detecting faces in %s\n', test_scenes(i).name)
    img = imread( fullfile( test_scn_path, test_scenes(i).name ));
    img = single(img)/255;
    if(size(img,3) > 1)
        img = rgb2gray(img);
    end

    k = 1;
    cur_bboxes_candi = [];
    cur_x_min_candi = [];
    cur_y_min_candi = [];
    test_hog = [];
    scale_set = [];
    scale_preset = [0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9];
    
    for scale_i = 1:size(scale_preset,2)
        scale = scale_preset(scale_i);
        resize_img = imresize(img,scale);
        img_hog = vl_hog(resize_img,feature_params.hog_cell_size);
        step_i = 2;
        step_j = 2;
        for slide_i = 1:step_i:size(img_hog,1)-feature_params.hog_cell_size+1
            for slide_j = 1:step_i:size(img_hog,2)-feature_params.hog_cell_size+1
                test_hog(k,:) = reshape(img_hog(slide_i:slide_i+feature_params.hog_cell_size-1,...
                    slide_j:slide_j+feature_params.hog_cell_size-1,:),1,feature_params.hog_cell_size^2*size(img_hog,3));
                scale_set(k,1) = scale;
                cur_x_min_candi(k,1) = fix(slide_j*feature_params.hog_cell_size/scale);
                cur_y_min_candi(k,1) = fix(slide_i*feature_params.hog_cell_size/scale);
                k = k+1;
            end
        end
    end

    threshold = -0.5;
    cur_confidences_candi = test_hog*w + b;
    idx = find(cur_confidences_candi>threshold);
    cur_confidences = cur_confidences_candi(idx,1);
    cur_x_min = cur_x_min_candi(idx,1);
    cur_y_min = cur_y_min_candi(idx,1);
    cur_bboxes = [[cur_x_min, cur_y_min,...
        fix(cur_x_min + feature_params.template_size./scale_set(idx)),...
        fix(cur_y_min + feature_params.template_size./scale_set(idx))]];
    cur_image_ids = cell(size(idx,1),1);
    cur_image_ids(1:size(idx,1),1) = {test_scenes(i).name};
    
    %anything in non_max_supr_bbox, but you can.
    [is_maximum] = non_max_supr_bbox(cur_bboxes, cur_confidences, size(img));

    cur_confidences = cur_confidences(is_maximum,:);
    cur_bboxes      = cur_bboxes(     is_maximum,:);
    cur_image_ids   = cur_image_ids(  is_maximum,:);
 
    bboxes      = [bboxes;      cur_bboxes];
    confidences = [confidences; cur_confidences];
    image_ids   = [image_ids;   cur_image_ids];
end

The comparison of visualized face detector is shown as follows (The right one is trained with augmented training samples and 60000 negative training samples):

The precision-recall curve and some detection results are given as follows:

Multi-Scale Face Detector with Augmented Positive Training Samples and Dynamic Search Step

I developed a way to further improve the average precision and the detection speed. I use different search step for different scale. For example, I use small search step for small scale and large search step for large scale. I achieve 0.727 average precision. The code of run_detector.m is modified as follows:


% face detector using sliding window (multiple scale and dynamic search step).
function [bboxes, confidences, image_ids] = .... 
    run_detector(test_scn_path, w, b, feature_params)

test_scenes = dir( fullfile( test_scn_path, '*.jpg' ));

%initialize these as empty and incrementally expand them.
bboxes = zeros(0,4);
confidences = zeros(0,1);
image_ids = cell(0,1);

for i = 1:length(test_scenes)
    
    fprintf('Detecting faces in %s\n', test_scenes(i).name)
    img = imread( fullfile( test_scn_path, test_scenes(i).name ));
    img = single(img)/255;
    if(size(img,3) > 1)
        img = rgb2gray(img);
    end

    k = 1;
    cur_bboxes_candi = [];
    cur_x_min_candi = [];
    cur_y_min_candi = [];
    test_hog = [];
    scale_set = [];
    scale_preset = [0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1,1.3,1.5];
    for scale_i = 1:size(scale_preset,2)
        scale = scale_preset(scale_i);
        resize_img = imresize(img,scale);
        img_hog = vl_hog(resize_img,feature_params.hog_cell_size);
        if scale < 0.5;
            step_i = 1;
            step_j = 1;
        elseif scale < 0.9 && scale >= 0.5
            step_i = 2;
            step_j = 2;
        else
            step_i = 3;
            step_j = 3;
        end    
        for slide_i = 1:step_i:size(img_hog,1)-feature_params.hog_cell_size+1
            for slide_j = 1:step_i:size(img_hog,2)-feature_params.hog_cell_size+1
                test_hog(k,:) = reshape(img_hog(slide_i:slide_i+feature_params.hog_cell_size-1,...
                    slide_j:slide_j+feature_params.hog_cell_size-1,:),1,feature_params.hog_cell_size^2*size(img_hog,3));
                scale_set(k,1) = scale;
                cur_x_min_candi(k,1) = fix(slide_j*feature_params.hog_cell_size/scale);
                cur_y_min_candi(k,1) = fix(slide_i*feature_params.hog_cell_size/scale);
                k = k+1;
            end
        end
    end

    threshold = -0.5;
    cur_confidences_candi = test_hog*w + b;
    idx = find(cur_confidences_candi>threshold);
    cur_confidences = cur_confidences_candi(idx,1);
    cur_x_min = cur_x_min_candi(idx,1);
    cur_y_min = cur_y_min_candi(idx,1);
    cur_bboxes = [[cur_x_min, cur_y_min,...
        fix(cur_x_min + feature_params.template_size./scale_set(idx)),...
        fix(cur_y_min + feature_params.template_size./scale_set(idx))]];
    cur_image_ids = cell(size(idx,1),1);
    cur_image_ids(1:size(idx,1),1) = {test_scenes(i).name};
    
    %non_max_supr_bbox can actually get somewhat slow with thousands of
    %initial detections. You could pre-filter the detections by confidence,
    %e.g. a detection with confidence -1.1 will probably never be
    %meaningful. You probably _don't_ want to threshold at 0.0, though. You
    %can get higher recall with a lower threshold. You don't need to modify
    %anything in non_max_supr_bbox, but you can.
    [is_maximum] = non_max_supr_bbox(cur_bboxes, cur_confidences, size(img));

    cur_confidences = cur_confidences(is_maximum,:);
    cur_bboxes      = cur_bboxes(     is_maximum,:);
    cur_image_ids   = cur_image_ids(  is_maximum,:);
 
    bboxes      = [bboxes;      cur_bboxes];
    confidences = [confidences; cur_confidences];
    image_ids   = [image_ids;   cur_image_ids];
end

The precision-recall curve and some detection results are given as follows: