Computer Vision is an area of research dealing with allowing computers to see the world like humans do, even though they work with a very different domain of data. Something humans tend to do a lot of is look at other humans. Humans are actually very good at recognizing other humans and as such we would like computers to be good at this task as well. In this project we attempt to create an face dector by implenting a sliding window dector of Dalal and Triggs 2005 .
To determine the success of our creation, we will be testing our face detector again the CMU+MIT test dataset. Ideally, our face detector will have these qualities:
The most important part of any object detector is training it properly. Training involves choosing a feature representation, getting positive feature examples, getting negative feature examples, and training a classifier based on those positive and negative features.
Like in Dalal and Triggs 2005, our detector uses a Histogram of Gradients (HoG) feature representation. After being trained appropriately, we got our HoG template to look like this:
Looking at the template, one can see an outline of a face slightly forming. The eyes and nose and shape of head can all be seen, even if they are faint.
Positive features were gathered from the an edited version of the Caltech Web Faces Project. Each image was converted to HoG feature space and added to the list of all positive HoG features.
Gathering negative features are a little more interesting. Like all computer vision tasks, the more negative examples you have, the better performance you can expect. To gather a certain number of negative HoG examples, an image directory is given. For each image in the directory, various patches at various scales are taken and converted to HoG feature space. The code for that can be seen here:
for i=1:num_images
img = imread(fullfile(non_face_scn_path, image_files(i).name));
if (size(img, 3) > 1)
img = rgb2gray(img);
end
cur_sample_count = 0;
cur_scale = 1.0;
while cur_sample_count < samples_per_image
img = imresize(img, cur_scale);
if (size(img,1) < feature_params.template_size * 2) |...
(size(img,2) < feature_params.template_size * 2)
break;
end
rand_xs = randi([1, size(img,2) - feature_params.template_size],...
SAMPLES_PER_SCALE, 1);
rand_ys = randi([1, size(img,1) - feature_params.template_size],...
SAMPLES_PER_SCALE, 1);
for j=1:SAMPLES_PER_SCALE
x = rand_xs(j);
y = rand_ys(j);
samp = img(y : y + feature_params.template_size,...
x : x + feature_params.template_size);
hog = vl_hog(single(samp), feature_params.hog_cell_size);
features_neg(samples_count , :) = hog(:);
cur_sample_count = cur_sample_count + 1;
samples_count = samples_count + 1;
end
cur_scale = cur_scale * SCALE_FACTOR;
end
end
Once all training data has been gathered, a classifier can be created. For this project, a SVM Classifier was used with a lambda value of .001
Once a classifier has been created, its time to start using it on testing data. A sliding window detector was used on a variety of test images. For each image, you convert it to HoG feature space. You iterate over that resulting HoG space and throw each HoG patch though the classifier. If the resulting confidence is above a certain threshold, it is said to be a face. After all patches are tested, the image is resized by a certain scaling factor and all patches tested again. Each image is rescaled until it can't be scaled any smaller and still be tested again the template. The code for the detector can be seen here:
for i = 1:length(test_scenes)
fprintf('Detecting faces in %s\n', test_scenes(i).name)
img = imread( fullfile( test_scn_path, test_scenes(i).name ));
img = single(img)/255;
if(size(img,3) > 1)
img = rgb2gray(img);
end
window_width = feature_params.template_size / feature_params.hog_cell_size;
window_height = feature_params.template_size / feature_params.hog_cell_size;
scale = 1;
cur_bboxes = zeros(0, 4);
cur_confidences = zeros(0, 1);
cur_image_ids = cell(0, 1);
for scale_count=1:15
scaled_img = imresize(img, scale);
if size(scaled_img, 1) < feature_params.template_size |...
size(scaled_img, 2) < feature_params.template_size
break;
end
img_hog = vl_hog(scaled_img, feature_params.hog_cell_size);
for j=1:size(img_hog, 1) - window_height
for k=1:size(img_hog, 2) - window_width
sample = img_hog(j:j+window_height-1, k:k+window_width-1, :);
confidence = w' * sample(:) + b;
if (confidence > CONFIDENCE_THRESHOLD)
x_min = floor((k * feature_params.hog_cell_size) / scale);
y_min = floor((j * feature_params.hog_cell_size) / scale);
x_max = floor((k * feature_params.hog_cell_size +...
feature_params.template_size) / scale);
y_max = floor((j * feature_params.hog_cell_size +...
feature_params.template_size) / scale);
cur_bboxes = [cur_bboxes; [x_min, y_min, x_max, y_max]];
cur_confidences = [cur_confidences; confidence];
cur_image_ids = [cur_image_ids; test_scenes(i).name];
end
end
end
scale = scale * SCALE_FACTOR;
end
[is_maximum] = non_max_supr_bbox(cur_bboxes, cur_confidences, size(img));
cur_confidences = cur_confidences(is_maximum,:);
cur_bboxes = cur_bboxes( is_maximum,:);
cur_image_ids = cur_image_ids( is_maximum,:);
bboxes = [bboxes; cur_bboxes];
confidences = [confidences; cur_confidences];
image_ids = [image_ids; cur_image_ids];
end
200000 was chosen as the number of samples. As all of the negative features had to be held in memory, there was an upper limit. MATLAB wouldn't support a size of 1000000, but since increases in the number of features gathered increases accuracies, further work could be done to handle large number of samples.
An average precision of 91% was achieved.
Bad results tended to fall into three categories: false positive, false negatives, or both.
False positives occured when a face was detected when no face truely existed.
False negatives occured when a face wasn't detected but a face was truly there.
Like most computer vision tasks, the goal is be as good as humans. However, humans are not always perfect at recognizing tasks. Just as some things may confuse humans, it can also confuse an algorithm as shown below.
Lastly, here is the result of face detection on a picture of our class. One is easy, and one is hard in the fact that we were attempting to hide our faces.