Face detection is one of the most successful computer vision applications. As an important preprocessing front end, it plays a significant role in improving a face recognition system. In this project, I implemente a baseline sliding window face detection algorithm. I also finished an extra credit by augmenting positive training data.
My design decision is summarized as follows:
I set the the number of negative training samples as 10000, the lambda for linear SVM as 0.0001, HoG cell size as 6, and template size as 36. For the detector, I go through the entire HoG feature maps by step size 1 (namely 6 pixels since HoG cell size is 6) and the threshold for linear SVM is -0.2. I only use singel scale in the baseline. The baseline achieves the average precision of 0.42.
The code of function get_positive_features.m
is given as follows:
% extract HoG features from positive training samples (faces).
function features_pos = get_positive_features(train_path_pos, feature_params)
image_files = dir( fullfile( train_path_pos, '*.jpg') ); %Caltech Faces stored as .jpg
num_images = length(image_files);
for i = 1:num_images
crop_face = imread(strcat(train_path_pos,'\',image_files(i).name));
crop_face_feat = vl_hog(single(crop_face)/255,feature_params.hog_cell_size);
features_pos(i,:) = reshape(crop_face_feat,1,numel(crop_face_feat));
end
end
The code of function get_random_negative_features.m
is given as follows:
% extract HoG features from positive training samples (faces).
function features_neg = get_random_negative_features(non_face_scn_path, feature_params, num_samples)
image_files = dir( fullfile( non_face_scn_path, '*.jpg' ));
num_images = length(image_files);
step_i = 20;
step_j = 20;
k = 1;
for i = 1:num_images
non_face = rgb2gray(imread(strcat(non_face_scn_path,'\',image_files(i).name)));
for slide_i = 1:step_i:size(non_face,1)-feature_params.template_size+1
for slide_j = 1:step_j:size(non_face,2)-feature_params.template_size+1
non_face_patch{k} = non_face(slide_i:slide_i+feature_params.template_size-1,...
slide_j:slide_j+feature_params.template_size-1);
k = k+1;
end
end
end
rand_perm = randperm(k-1);
for i = 1:num_samples
neg_feat = non_face_patch{i};
non_face_feat = vl_hog(single(neg_feat)/255,feature_params.hog_cell_size);
features_neg(i,:) = reshape(non_face_feat,1,numel(non_face_feat));
end
end
The linear SVM is simply implemented by a few lines in proj5.m
X = [features_pos',features_neg'];
Y = [ones(size(features_pos,1),1);-ones(size(features_neg,1),1)];
[w b] = vl_svmtrain(X, Y, 0.0001);
The code of function run_detector.m
is given as follows:
% face detector using sliding window (single scale).
function [bboxes, confidences, image_ids] = ....
run_detector(test_scn_path, w, b, feature_params)
test_scenes = dir( fullfile( test_scn_path, '*.jpg' ));
%initialize these as empty and incrementally expand them.
bboxes = zeros(0,4);
confidences = zeros(0,1);
image_ids = cell(0,1);
for i = 1:length(test_scenes)
fprintf('Detecting faces in %s\n', test_scenes(i).name)
img = imread( fullfile( test_scn_path, test_scenes(i).name ));
img = single(img)/255;
if(size(img,3) > 1)
img = rgb2gray(img);
end
k = 1;
cur_bboxes_candi = [];
cur_x_min_candi = [];
cur_y_min_candi = [];
test_hog = [];
scale_set = [];
scale_preset = 1;%[0.6,1,1.2,1.4];
for scale_i = 1:size(scale_preset,2)
scale = scale_preset(scale_i);
resize_img = imresize(img,scale);
img_hog = vl_hog(resize_img,feature_params.hog_cell_size);
step_i = 1;
step_j = 1;
for slide_i = 1:step_i:size(img_hog,1)-feature_params.hog_cell_size+1
for slide_j = 1:step_i:size(img_hog,2)-feature_params.hog_cell_size+1
test_hog(k,:) = reshape(img_hog(slide_i:slide_i+feature_params.hog_cell_size-1,...
slide_j:slide_j+feature_params.hog_cell_size-1,:),1,feature_params.hog_cell_size^2*size(img_hog,3));
scale_set(k,1) = scale;
cur_x_min_candi(k,1) = fix(slide_j*feature_params.hog_cell_size/scale);
cur_y_min_candi(k,1) = fix(slide_i*feature_params.hog_cell_size/scale);
k = k+1;
end
end
end
threshold = -0.2;
cur_confidences_candi = test_hog*w + b;
idx = find(cur_confidences_candi>threshold);
cur_confidences = cur_confidences_candi(idx,1);
cur_x_min = cur_x_min_candi(idx,1);
cur_y_min = cur_y_min_candi(idx,1);
cur_bboxes = [[cur_x_min, cur_y_min,...
fix(cur_x_min + feature_params.template_size./scale_set(idx)),...
fix(cur_y_min + feature_params.template_size./scale_set(idx))]];
cur_image_ids = cell(size(idx,1),1);
cur_image_ids(1:size(idx,1),1) = {test_scenes(i).name};
[is_maximum] = non_max_supr_bbox(cur_bboxes, cur_confidences, size(img));
cur_confidences = cur_confidences(is_maximum,:);
cur_bboxes = cur_bboxes( is_maximum,:);
cur_image_ids = cur_image_ids( is_maximum,:);
bboxes = [bboxes; cur_bboxes];
confidences = [confidences; cur_confidences];
image_ids = [image_ids; cur_image_ids];
end
I visual the face detector and plot the precision-recall curve as follows(The average precision is 0.42):
To better show the detection results, we give a few detection examples:
Different from the baseline, I implement the multi-scale face detector, making the detector adpative to mutiple scale faces. The average precision is further increased to 0.607. The threshold is set as -0.5 and the other parameters remain the same as baseline. To speed up the detection, we increase the step size from 1 HoG cell to 2 HoG cells.
The only different is the function run_detector.m
which is given as follows:
% face detector using sliding window (multiple scale).
function [bboxes, confidences, image_ids] = ....
run_detector(test_scn_path, w, b, feature_params)
test_scenes = dir( fullfile( test_scn_path, '*.jpg' ));
bboxes = zeros(0,4);
confidences = zeros(0,1);
image_ids = cell(0,1);
for i = 1:length(test_scenes)
fprintf('Detecting faces in %s\n', test_scenes(i).name)
img = imread( fullfile( test_scn_path, test_scenes(i).name ));
img = single(img)/255;
if(size(img,3) > 1)
img = rgb2gray(img);
end
k = 1;
cur_bboxes_candi = [];
cur_x_min_candi = [];
cur_y_min_candi = [];
test_hog = [];
scale_set = [];
scale_preset = [0.2,0.4,0.7,0.9];
for scale_i = 1:size(scale_preset,2)
scale = scale_preset(scale_i);
resize_img = imresize(img,scale);
img_hog = vl_hog(resize_img,feature_params.hog_cell_size);
step_i = 2;
step_j = 2;
for slide_i = 1:step_i:size(img_hog,1)-feature_params.hog_cell_size+1
for slide_j = 1:step_i:size(img_hog,2)-feature_params.hog_cell_size+1
test_hog(k,:) = reshape(img_hog(slide_i:slide_i+feature_params.hog_cell_size-1,...
slide_j:slide_j+feature_params.hog_cell_size-1,:),1,feature_params.hog_cell_size^2*size(img_hog,3));
scale_set(k,1) = scale;
cur_x_min_candi(k,1) = fix(slide_j*feature_params.hog_cell_size/scale);
cur_y_min_candi(k,1) = fix(slide_i*feature_params.hog_cell_size/scale);
k = k+1;
end
end
end
threshold = -0.5;
cur_confidences_candi = test_hog*w + b;
idx = find(cur_confidences_candi>threshold);
cur_confidences = cur_confidences_candi(idx,1);
cur_x_min = cur_x_min_candi(idx,1);
cur_y_min = cur_y_min_candi(idx,1);
cur_bboxes = [[cur_x_min, cur_y_min,...
fix(cur_x_min + feature_params.template_size./scale_set(idx)),...
fix(cur_y_min + feature_params.template_size./scale_set(idx))]];
cur_image_ids = cell(size(idx,1),1);
cur_image_ids(1:size(idx,1),1) = {test_scenes(i).name};
%anything in non_max_supr_bbox, but you can.
[is_maximum] = non_max_supr_bbox(cur_bboxes, cur_confidences, size(img));
cur_confidences = cur_confidences(is_maximum,:);
cur_bboxes = cur_bboxes( is_maximum,:);
cur_image_ids = cur_image_ids( is_maximum,:);
bboxes = [bboxes; cur_bboxes];
confidences = [confidences; cur_confidences];
image_ids = [image_ids; cur_image_ids];
end
The precision-recall curve and some detection results are given as follows:
I further improve the performance of linear SVM by augmenting the positive training samples and increasing the negative training samples. The average precision is further boosted to 0.722. We visualize the face detector of three types of SVM. In the visualized face detectors, from the left, the first is the SVM trained with original positive samples and 10000 negative samples, the second is the SVM trainined with augmented training samples (original + mirrors) and 40000 negative samples, the third is the SVM trainined with augmented training samples and 60000 negative samples. We use the last one in this experiment. For the multiple scale, we increase the number of scales compared to the previous one. The get_positive_features.m
function is modified as:
% get augmented positive training samples
function features_pos = get_positive_features(train_path_pos, feature_params)
image_files = dir( fullfile( train_path_pos, '*.jpg') ); %Caltech Faces stored as .jpg
num_images = length(image_files);
% augment positive training samples
for i = 1:num_images
ori_face{1,i} = imread(strcat(train_path_pos,'\',image_files(i).name));
[height,width,dim] = size(ori_face{i});
tform = maketform('affine',[-1 0 0;0 1 0;width 0 1]);
aug_face{1,i} = imtransform(ori_face{i},tform,'nearest');
end
all_face = cat(2,ori_face,aug_face);
for i = 1:size(all_face,2)
crop_face = all_face{i};
crop_face_feat = vl_hog(single(crop_face)/255,feature_params.hog_cell_size);
features_pos(i,:) = reshape(crop_face_feat,1,numel(crop_face_feat));
end
end
The code of run_detector.m
is shown as follows:
% face detector using sliding window (multiple scale).
function [bboxes, confidences, image_ids] = ....
run_detector(test_scn_path, w, b, feature_params)
test_scenes = dir( fullfile( test_scn_path, '*.jpg' ));
bboxes = zeros(0,4);
confidences = zeros(0,1);
image_ids = cell(0,1);
for i = 1:length(test_scenes)
fprintf('Detecting faces in %s\n', test_scenes(i).name)
img = imread( fullfile( test_scn_path, test_scenes(i).name ));
img = single(img)/255;
if(size(img,3) > 1)
img = rgb2gray(img);
end
k = 1;
cur_bboxes_candi = [];
cur_x_min_candi = [];
cur_y_min_candi = [];
test_hog = [];
scale_set = [];
scale_preset = [0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9];
for scale_i = 1:size(scale_preset,2)
scale = scale_preset(scale_i);
resize_img = imresize(img,scale);
img_hog = vl_hog(resize_img,feature_params.hog_cell_size);
step_i = 2;
step_j = 2;
for slide_i = 1:step_i:size(img_hog,1)-feature_params.hog_cell_size+1
for slide_j = 1:step_i:size(img_hog,2)-feature_params.hog_cell_size+1
test_hog(k,:) = reshape(img_hog(slide_i:slide_i+feature_params.hog_cell_size-1,...
slide_j:slide_j+feature_params.hog_cell_size-1,:),1,feature_params.hog_cell_size^2*size(img_hog,3));
scale_set(k,1) = scale;
cur_x_min_candi(k,1) = fix(slide_j*feature_params.hog_cell_size/scale);
cur_y_min_candi(k,1) = fix(slide_i*feature_params.hog_cell_size/scale);
k = k+1;
end
end
end
threshold = -0.5;
cur_confidences_candi = test_hog*w + b;
idx = find(cur_confidences_candi>threshold);
cur_confidences = cur_confidences_candi(idx,1);
cur_x_min = cur_x_min_candi(idx,1);
cur_y_min = cur_y_min_candi(idx,1);
cur_bboxes = [[cur_x_min, cur_y_min,...
fix(cur_x_min + feature_params.template_size./scale_set(idx)),...
fix(cur_y_min + feature_params.template_size./scale_set(idx))]];
cur_image_ids = cell(size(idx,1),1);
cur_image_ids(1:size(idx,1),1) = {test_scenes(i).name};
%anything in non_max_supr_bbox, but you can.
[is_maximum] = non_max_supr_bbox(cur_bboxes, cur_confidences, size(img));
cur_confidences = cur_confidences(is_maximum,:);
cur_bboxes = cur_bboxes( is_maximum,:);
cur_image_ids = cur_image_ids( is_maximum,:);
bboxes = [bboxes; cur_bboxes];
confidences = [confidences; cur_confidences];
image_ids = [image_ids; cur_image_ids];
end
The comparison of visualized face detector is shown as follows (The right one is trained with augmented training samples and 60000 negative training samples):
The precision-recall curve and some detection results are given as follows:
I developed a way to further improve the average precision and the detection speed. I use different search step for different scale. For example, I use small search step for small scale and large search step for large scale. I achieve 0.727 average precision. The code of run_detector.m
is modified as follows:
% face detector using sliding window (multiple scale and dynamic search step).
function [bboxes, confidences, image_ids] = ....
run_detector(test_scn_path, w, b, feature_params)
test_scenes = dir( fullfile( test_scn_path, '*.jpg' ));
%initialize these as empty and incrementally expand them.
bboxes = zeros(0,4);
confidences = zeros(0,1);
image_ids = cell(0,1);
for i = 1:length(test_scenes)
fprintf('Detecting faces in %s\n', test_scenes(i).name)
img = imread( fullfile( test_scn_path, test_scenes(i).name ));
img = single(img)/255;
if(size(img,3) > 1)
img = rgb2gray(img);
end
k = 1;
cur_bboxes_candi = [];
cur_x_min_candi = [];
cur_y_min_candi = [];
test_hog = [];
scale_set = [];
scale_preset = [0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1,1.3,1.5];
for scale_i = 1:size(scale_preset,2)
scale = scale_preset(scale_i);
resize_img = imresize(img,scale);
img_hog = vl_hog(resize_img,feature_params.hog_cell_size);
if scale < 0.5;
step_i = 1;
step_j = 1;
elseif scale < 0.9 && scale >= 0.5
step_i = 2;
step_j = 2;
else
step_i = 3;
step_j = 3;
end
for slide_i = 1:step_i:size(img_hog,1)-feature_params.hog_cell_size+1
for slide_j = 1:step_i:size(img_hog,2)-feature_params.hog_cell_size+1
test_hog(k,:) = reshape(img_hog(slide_i:slide_i+feature_params.hog_cell_size-1,...
slide_j:slide_j+feature_params.hog_cell_size-1,:),1,feature_params.hog_cell_size^2*size(img_hog,3));
scale_set(k,1) = scale;
cur_x_min_candi(k,1) = fix(slide_j*feature_params.hog_cell_size/scale);
cur_y_min_candi(k,1) = fix(slide_i*feature_params.hog_cell_size/scale);
k = k+1;
end
end
end
threshold = -0.5;
cur_confidences_candi = test_hog*w + b;
idx = find(cur_confidences_candi>threshold);
cur_confidences = cur_confidences_candi(idx,1);
cur_x_min = cur_x_min_candi(idx,1);
cur_y_min = cur_y_min_candi(idx,1);
cur_bboxes = [[cur_x_min, cur_y_min,...
fix(cur_x_min + feature_params.template_size./scale_set(idx)),...
fix(cur_y_min + feature_params.template_size./scale_set(idx))]];
cur_image_ids = cell(size(idx,1),1);
cur_image_ids(1:size(idx,1),1) = {test_scenes(i).name};
%non_max_supr_bbox can actually get somewhat slow with thousands of
%initial detections. You could pre-filter the detections by confidence,
%e.g. a detection with confidence -1.1 will probably never be
%meaningful. You probably _don't_ want to threshold at 0.0, though. You
%can get higher recall with a lower threshold. You don't need to modify
%anything in non_max_supr_bbox, but you can.
[is_maximum] = non_max_supr_bbox(cur_bboxes, cur_confidences, size(img));
cur_confidences = cur_confidences(is_maximum,:);
cur_bboxes = cur_bboxes( is_maximum,:);
cur_image_ids = cur_image_ids( is_maximum,:);
bboxes = [bboxes; cur_bboxes];
confidences = [confidences; cur_confidences];
image_ids = [image_ids; cur_image_ids];
end
The precision-recall curve and some detection results are given as follows: