Project 5 / Face Detection with a Sliding Window

Training data for face detection

Positive training examples were taken from a database of 6,713 cropped 36x36 faces from the Caltech Web Faces project. On the other hand, 10,000 negative examples were taken from randomly cropped 36x36 segments from Wu et al. and the SUN scene database.

HoG features

Histogram of Gradients (HoG) is a feature used to characterize faces as part of the detection pipeline. The figure below shows an example of a learned template for facial features.

%HoG features
for i=1:num_images
    f = fullfile(train_path_pos,image_files(i).name);
    I = single(imread(f)) / 255;
    hog = vl_hog(I,feature_params.hog_cell_size);
    features_pos(i,:) = reshape(hog,1,D);
end

A linear SVM classifier is trained on the collected positive and negative training examples in HoG feature space. The result is stored in a weight vector and a bias. The graph below shows a visualization of the separation between positive and negative training examples.

%SVM classifier
X = [features_pos' features_neg'];
Y = [ones(1,size(features_pos,1)) -ones(1,size(features_neg,1))];
[w,b] = vl_svmtrain(X, Y, lambda); 

The classifier is applied at test time in the framework of a multiscale sliding window detector. HoG features are first calculated from the test image and then convolved with the SVM weights. The resulting score matrix indicates the confidence of a particular window being detected as a face. The score matrix is then filtered for values that are greater than a pre-defined absolute threshold. Detections that are too close to the boundary are rejected. The derived x and y coordinates are used to form bounding boxes around detected regions. This process is repeated at multiple scales in order to account for faces of different sizes. Finally, non-maximum suppression is used to combine overlapping bounding boxes.

%Multiscale sliding window detector
for k=1:feature_params.scales
        hog = vl_hog(img,feature_params.hog_cell_size);
        for x=1:size(hog,1)-numCells+1
            for y=1:size(hog,2)-numCells+1
                score = reshape(hog(x:x+numCells-1,y:y+numCells-1,:),1,numCells*numCells*31)*w + b;
                if score > feature_params.abs_threshold
                    x_min = (y-1) * feature_params.hog_cell_size + 1;
                    y_min = (x-1) * feature_params.hog_cell_size + 1;
                    bb = [x_min,y_min,x_min+feature_params.template_size-1,y_min+feature_params.template_size-1] / (scale_factor ^ (k-1));
                    cur_bboxes(l,:) = bb;
                    cur_confidences(l) = score;
                end
            end
        end
        img = imresize(img, scale_factor, 'bilinear');
    end

Face detection Results

The results below are obtained using a multiscale detector with cell size of 3, lambda parameter of 0.0001, and absolute threshold of -0.5. The graphs shown are that of average precision and recall vs false positives.

The results are also compared between single-scale, multi-scale, and detectors with different cell sizes.

DetectorAverage precision (AP)
Single scale cell size = 60.331
Multiscale cell size = 60.714
Multiscale cell size = 40.760
Multiscale cell size = 30.762

Hard negative mining (EXTRA CREDIT)

The technique of hard negative mining is used to train the classifier with respect to challenging negative examples in order to improve the discriminative ability of the classifier. A face detector is first trained with random sampling and used to classify images from the negative training samples. Samples with positive scores (hard negatives) are then used to train a separate classifier. In the table below, the number of negative training samples is fixed at 100,1000,and 10000, and the average precision is compared between random sampling and hard negative mining. Based on the results, hard negative mining does not have a positive effect on the average precision. This may be because the hard negative examples lower the overall confidence level, so fewer true positive detections are made.

# -ve Training samplesAverage precision (random sampling)Average precision (hard negative mining)
1000.5940.577
10000.6390.577
100000.6650.537

Data augmentation (EXTRA CREDIT)

Data augmentation is used to increase the number of training examples when training data is limited. In this case, new training samples are artificially created by perturbing existing training samples by various transformations such as horizontal flip and brightness scaling. The results show that data augmentation does not significantly increase the average precision. This suggests that the original dataset already contained enough variation that data augmentation does not create any new information.

Detector# +ve Training samples# -ve Training samplesAverage precision (AP)
Baseline6713100000.665
Horizontal Flip13426200000.636
Brightness Scaling13426200000.663
Horizontal Flip + Brightness Scaling20139300000.668