Project 5 / Face Detection with a Sliding Window

To create a face detector, I implemented the chain of tasks discussed in Dalal Triggs - to create a linear classifier and then use a sliding window model to determine the location of the positive classifications in an image. The process I followed is:

  1. Create positive and negative training examples using Histogram of Gradient features for attributes
  2. Train a Linear Support Vector Machine based on the training data
  3. Transform the testing data into HoG features and use a sliding window to iterate over the HoG cells in the image.

Creating Training Data

For the positive training instances, I used the caltech faces dataset, in which all the images are 36x36 pixels. I simply transformed the images into HoG features by using vl_feat's vl_hog method and then flattened the image into a column of 1116 attributes. For the negative instances, I randomly collected and transformed portions of the negative images based on the number of samples desired and the number of images provided.


w_sample = randsample(size(hog,1)-feature_params.hog_cell_size-1, num_windows);
h_sample = randsample(size(hog,2)-feature_params.hog_cell_size, num_windows);
for j = 1:min(length(w_sample), length(h_sample))
       w = w_sample(j);
       h = h_sample(j);
       hog_cell = hog(w:w+feature_params.hog_cell_size-1, h:h+feature_params.hog_cell_size-1,:);
       features_neg(idx,:) = reshape(hog_cell,[D,1]);
       idx = idx + 1;
   end

Training the Support Vector Machine

I used vl_feat's vl_svmtrain with lambda = 0.00001 on the positive and negative training data collected to create the parameters w and b, by which an instance can be multiplied as (w'x + b) to produce a score of how likely it is a positive or negative classification. Testing on the same training set used to create the svm, I got an accuracy of 0.999, and the division between the positive and negative scores is shown below.

Classifying with a Sliding Window

For each image in the testing set, I scaled it to multiple degrees created a HoG for each scale and then iterated through the HoG cells and multiplied the attributes of each by the SVM parameters. I kept any cells with a score above 0.7. Then, after doing this for each scale of the image, used non maximum suppression to determine which of the overlapping detections was the most likely to be the true face location. I was able to reach an average precision of 80.4%, and some of the more successful images are shown in the bottom table.

To increase precision, I lowered the confidence threshold to 0.2, manipulated lambda between 0.0001 and varied the number of negative samples from 10000 - 100000. Lowering the confidence classifies more non-face regions as faces than the previous thresholding, and increasing the number of negative training images tended to lower the precision. The highest I saw was 84.2% for average precision.


hog = vl_hog(img_scaled, feature_params.hog_cell_size);
for x=1:size(hog,1)-feature_params.hog_cell_size
	for y=1:size(hog,2)-feature_params.hog_cell_size
	    hog_cell = hog(x:x+feature_params.hog_cell_size-1,...
		    y:y+feature_params.hog_cell_size-1,:);
	    confidence = reshape(hog_cell, [1, D]) * w + b;
		if confidence > 0.2
			cur_bboxes = [y*multiplier/scale,...
				x*multiplier/scale,...
				(y+feature_params.hog_cell_size)*multiplier/scale,...
				(x+feature_params.hog_cell_size)*multiplier/scale; cur_bboxes];
			cur_confidences = [confidence; cur_confidences];
			cur_image_ids = [{test_scenes(i).name} ; cur_image_ids];
	   end
	end
end

Notable examples