Project 5 / Face Detection with a Sliding Window

In this project, the task was to implement an object detection system for faces. Given an image, the program is supposed to draw bounding boxes around every face. The process contains several parts:

  1. Convert positive face examples to HOG (Histogram of Oriented Gradients) features.
  2. Mine random patches from images with no faces and convert these patches to HOG features.
  3. Train a linear SVM that classifies square patches as faces or non-faces.
  4. Mine hard negative features and convert to HOG features and train a linear SVM(extra).
  5. Run a this patch as a sliding window along test images at multiple scales and highlight all detected faces.

Get positive features

This process is very straightforward. Each face image with 36x36 size is converted to HOG feature using vl_hog function. The size of each feature is (template_size/hog_cell_size)^2*31

Get negative features

First, the number of HOG feature needed for each image is calculated. Since some image'size is similiar to template size so that it could not generate the average HOG feature. I modify the average number with a normal ditribution random number to make sure the requested amount could be satisfied. And then, first convert the image to HOG features, randomly select the HOG feature location according to the image HOG size and the template HOG size and then resize the image to generate another HOG feature. The process will continue until the image size is smaller than the template size or the requested amount of features is satisfied.


while i < num_images && total_samples < num_samples
    disp(i)
    img = single(rgb2gray(imread(fullfile(non_face_scn_path, image_files(i).name))))/255;
    num_hog_rand = hog_per_img*(1+(rand-0.5)*0.5);
    example_mined = 0;
    NumPerScales = log(feature_params.template_size/min(size(img)))/log(0.9);
    while example_mined < num_hog_rand && min(size(img))>feature_params.template_size
        hog_tmp = vl_hog(img,feature_params.hog_cell_size);
        num_tmp_scale = 0;
        while num_tmp_scale < NumPerScales
            i_row = randsample(size(hog_tmp,1)-feature_params.template_size/feature_params.hog_cell_size+1,1);
            i_col = randsample(size(hog_tmp,2)-feature_params.template_size/feature_params.hog_cell_size+1,1);
            feature_tmp = hog_tmp(i_row:(i_row+feature_params.template_size/feature_params.hog_cell_size-1),...
                i_col:(i_col+feature_params.template_size/feature_params.hog_cell_size-1),:);
            feature_tmp = reshape(feature_tmp, [1,dim]);
            features_neg(total_samples+1,:) = feature_tmp;
            num_tmp_scale = num_tmp_scale +1;
            total_samples = total_samples +1;
            example_mined = example_mined +1;
        end
        img = imresize(img,0.9);
    end
    i =i+1;    
end
features_neg = features_neg(1:num_samples,:);

Train linear SVM classifier

The process is straightforward. Combine the positive and negative features togather and generate a label set respectively. vl_svmstrain will calculate the weight and bias. After several experiment , I decide lambda = 0.0001.

Get Hard negative features [EXTRA]

The hard negative mining as described in Dalal-Triggs 2005 is also implemented. With the calculated SVM weight and bias, the nagetive image is resample and calculate the confidence using sliding window technique. For those has positive confidennce, they might be consider as a false positive. So that these features are picked and added to the list of feature negative. Using the new nagative features, positive features and the corresponding label set, another pair of SVM weight and bias is calculated.


for i = 1:length(test_scenes)
    img = imread( fullfile( non_face_scn_path, test_scenes(i).name ));
    img = single(img)/255;
    if(size(img,3) > 1)
        img = rgb2gray(img);
    end
    resize_ratio = 1.2;
    ratio = 0.9;
    while (min(size(img)) > feature_params.template_size)
        hog_tmp = vl_hog(img, feature_params.hog_cell_size);
        for i_row = hog_per_template : size(hog_tmp,1)
            for i_col = hog_per_template : size(hog_tmp,2)
                feature_tmp = hog_tmp((i_row - hog_per_template + 1):i_row,...
                    (i_col - hog_per_template + 1):i_col, :);
                feature_tmp = reshape(feature_tmp, [1 dim]);
                conf = feature_tmp*w+b;
                if (conf > thres)
                    features_hard_neg = [features_hard_neg;feature_tmp];
                end
            end
        end
        img = imresize(img, ratio);
        resize_ratio = resize_ratio * ratio;
    end
end

Sliding window detection

I first resize the image by 1.2 just incase the image is smaller than the template one.

It is used a sliding window to determine whether or not there was a potential detection. A linear classifier, which is generated before, on the HoG cell in the sliding window and calculating the confidence. If the confidence is greater than the threshold, then such window is considered as positive detection and added to the list. After the sliding window finished, non-maximum suppression was used to remove any nearby duplicates. Since the images' size and the face patterns' size in each image are different, it is neccesary to detect the image at multiple scales. I resize the image by 0.95 each time until the image size is small than the template size.


    while min(size(img_resize))>feature_params.template_size
        hog_tmp = vl_hog(img_resize,feature_params.hog_cell_size);
        for i_row = hog_per_template : size(hog_tmp,1)
            for i_col = hog_per_template : size(hog_tmp,2)
                feature_tmp = hog_tmp((i_row - hog_per_template+1):i_row, ...
                    (i_col - hog_per_template+1):i_col, :);
                feature_tmp = reshape(feature_tmp, [1 dim]);
                conf = feature_tmp*w+b;
                if conf > thres
                    bbox = [(i_col - hog_per_template+1),...
                        (i_row - hog_per_template+1),i_col,i_row]/resize_ratio*...
                        feature_params.hog_cell_size;
                    cur_bboxes = [cur_bboxes; bbox];
                    cur_confidences = [cur_confidences; conf];
                    cur_image_ids = [cur_image_ids; {test_scenes(i).name}];
                end
            end
        end
        img_resize = imresize(img_resize, ratio);
        resize_ratio = resize_ratio*ratio;
    end

Observation

With different number of negative examples the accuracy is different. In general the more negative examples, the better accuracy we can have. And hard negative mining can help accuarcy more effectively when the number of negative example is small.

neg = 20000 accuracy = 88.3 hard_accuracy = 89.5

neg = 10000 accuracy = 86.5 hard_accuracy = 86.0

neg = 5000 accuracy = 85.9 hard_accuracy = 87.6

I also change the number of cell size, and find out that the accuracy is not increased with the cost of more running time. neg = 20000 hog_cell = 3, accuracy = 85.3

neg = 20000 hog_cell = 6, accuracy = 89.5

Results

Face template HoG visualization for the starter code. This is completely random, but it should actually look like a face once you train a reasonable classifier.

Precision Recall curve for the starter code.

Example of detection on the test set from the starter code.

Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.