Project 5 / Face Detection with a Sliding Window

Computer Vision is an area of research dealing with allowing computers to see the world like humans do, even though they work with a very different domain of data. Something humans tend to do a lot of is look at other humans. Humans are actually very good at recognizing other humans and as such we would like computers to be good at this task as well. In this project we attempt to create an face dector by implenting a sliding window dector of Dalal and Triggs 2005 .

To determine the success of our creation, we will be testing our face detector again the CMU+MIT test dataset. Ideally, our face detector will have these qualities:

Training

The most important part of any object detector is training it properly. Training involves choosing a feature representation, getting positive feature examples, getting negative feature examples, and training a classifier based on those positive and negative features.

Feature Representation

Like in Dalal and Triggs 2005, our detector uses a Histogram of Gradients (HoG) feature representation. After being trained appropriately, we got our HoG template to look like this:

Looking at the template, one can see an outline of a face slightly forming. The eyes and nose and shape of head can all be seen, even if they are faint.

Positive Training

Positive features were gathered from the an edited version of the Caltech Web Faces Project. Each image was converted to HoG feature space and added to the list of all positive HoG features.

Negative Training

Gathering negative features are a little more interesting. Like all computer vision tasks, the more negative examples you have, the better performance you can expect. To gather a certain number of negative HoG examples, an image directory is given. For each image in the directory, various patches at various scales are taken and converted to HoG feature space. The code for that can be seen here:


    for i=1:num_images
        img = imread(fullfile(non_face_scn_path, image_files(i).name));
        
        if (size(img, 3) > 1)
            img = rgb2gray(img);
        end
        
        cur_sample_count = 0;
        cur_scale = 1.0;
        
        while cur_sample_count < samples_per_image
            img = imresize(img, cur_scale);
            
            if (size(img,1) < feature_params.template_size * 2) |...
                    (size(img,2) < feature_params.template_size * 2)
                break;
            end
            
            rand_xs = randi([1, size(img,2) - feature_params.template_size],...
                    SAMPLES_PER_SCALE, 1);
            rand_ys = randi([1, size(img,1) - feature_params.template_size],...
                    SAMPLES_PER_SCALE, 1);
            
            for j=1:SAMPLES_PER_SCALE
                x = rand_xs(j);
                y = rand_ys(j);
                
                samp = img(y : y + feature_params.template_size,...
                    x : x + feature_params.template_size);
                
                hog = vl_hog(single(samp), feature_params.hog_cell_size);
                
                features_neg(samples_count , :) = hog(:);
                
                cur_sample_count = cur_sample_count + 1;
                samples_count = samples_count + 1;
            end
            
            cur_scale = cur_scale * SCALE_FACTOR;
        end 
    end

Classifer

Once all training data has been gathered, a classifier can be created. For this project, a SVM Classifier was used with a lambda value of .001

Detection

Once a classifier has been created, its time to start using it on testing data. A sliding window detector was used on a variety of test images. For each image, you convert it to HoG feature space. You iterate over that resulting HoG space and throw each HoG patch though the classifier. If the resulting confidence is above a certain threshold, it is said to be a face. After all patches are tested, the image is resized by a certain scaling factor and all patches tested again. Each image is rescaled until it can't be scaled any smaller and still be tested again the template. The code for the detector can be seen here:


    for i = 1:length(test_scenes)
          
        fprintf('Detecting faces in %s\n', test_scenes(i).name)
        img = imread( fullfile( test_scn_path, test_scenes(i).name ));
        img = single(img)/255;
        if(size(img,3) > 1)
            img = rgb2gray(img);
        end
        
        window_width = feature_params.template_size / feature_params.hog_cell_size;
        window_height = feature_params.template_size / feature_params.hog_cell_size;
        scale = 1;
 
        cur_bboxes = zeros(0, 4);
        cur_confidences = zeros(0, 1);
        cur_image_ids = cell(0, 1);
        
        for scale_count=1:15
            scaled_img = imresize(img, scale);
            
            if size(scaled_img, 1) < feature_params.template_size |...
                    size(scaled_img, 2) < feature_params.template_size
                break;
            end
            
            img_hog = vl_hog(scaled_img, feature_params.hog_cell_size);
            
            for j=1:size(img_hog, 1) - window_height
                for k=1:size(img_hog, 2) - window_width
                    sample = img_hog(j:j+window_height-1, k:k+window_width-1, :);  
                    
                    confidence = w' * sample(:) + b;
                    
                    if (confidence > CONFIDENCE_THRESHOLD)
                        x_min = floor((k * feature_params.hog_cell_size) / scale);
                        y_min = floor((j * feature_params.hog_cell_size) / scale);
                        x_max = floor((k * feature_params.hog_cell_size +...
                                feature_params.template_size) / scale);
                        y_max = floor((j * feature_params.hog_cell_size +...
                                feature_params.template_size) / scale);
                        
                        cur_bboxes = [cur_bboxes; [x_min, y_min, x_max, y_max]];
                        cur_confidences = [cur_confidences; confidence];
                        cur_image_ids = [cur_image_ids; test_scenes(i).name];
                    end
                end
            end
                    
            scale = scale * SCALE_FACTOR;
        end

        [is_maximum] = non_max_supr_bbox(cur_bboxes, cur_confidences, size(img));
 
        cur_confidences = cur_confidences(is_maximum,:);
        cur_bboxes      = cur_bboxes(     is_maximum,:);
        cur_image_ids   = cur_image_ids(  is_maximum,:);
  
        bboxes      = [bboxes;      cur_bboxes];
        confidences = [confidences; cur_confidences];
        image_ids   = [image_ids;   cur_image_ids];
    end

Parameters

Number of Negative Samples

200000 was chosen as the number of samples. As all of the negative features had to be held in memory, there was an upper limit. MATLAB wouldn't support a size of 1000000, but since increases in the number of features gathered increases accuracies, further work could be done to handle large number of samples.

HoG Cell Size

A HoG cell size of 3 was chosen. This slowed down performance from the inital value of 6, but increased accuracy from around 88% to around 91%.

Scaling Factor

A scalling factor of 0.9 was chosen. Again, this slowed down performance, but increase accuracy from around 83% to around 88%.

Confidence Threshold

The confidence threshold was set to be 0.6. A higher value would decrease accuracy, but would also decrease false postiives. Depending on the situation, fewer false positives could be more desirable, but for this project, a higher accuracy was favored.

Results

An average precision of 91% was achieved.

Good Results

Bad Results

Bad results tended to fall into three categories: false positive, false negatives, or both.

False Positives

False positives occured when a face was detected when no face truely existed.

False Negatives

False negatives occured when a face wasn't detected but a face was truly there.

Questionable Results

Like most computer vision tasks, the goal is be as good as humans. However, humans are not always perfect at recognizing tasks. Just as some things may confuse humans, it can also confuse an algorithm as shown below.

Our detector was more of a head detector instead of a true face detector. Faces not in actual heads can confuse the detector.
When asking humans to find faces, they might not include Minne Mouse but Minnie definitely has a face and while it might not be a human face, it has human-like qualities.
A baby's face has different qualities than an adult face. A seperate baby face detector might need to be created to find baby faces.
If a face is partially occluded, it will be harder to detect. A half face or quater face detector might need to be created to detect occluded faces.
Noise can make a face detector confused in addition to a face having different ratios for their parts.

Lastly, here is the result of face detection on a picture of our class. One is easy, and one is hard in the fact that we were attempting to hide our faces.