Project 5: Face detection with a sliding window


Get positive features

This function should return all positive training examples (faces). Each face should be converted into a HoG template according to 'feature_params'


	image_files = dir( fullfile( train_path_pos, '*.jpg') ); %Caltech Faces stored as .jpg

	N = length(image_files);
	D = (feature_params.template_size / feature_params.hog_cell_size)^2 * 31;
	features_pos = zeros(N, D);

	for i = 1 : N
	    fprintf('get positive feature of image %d\n', i);
	    image = im2single(imread(fullfile(train_path_pos, image_files(i).name)));
	    hog_feature = vl_hog(image, feature_params.hog_cell_size);
	    hog_feature = reshape(hog_feature, [1, D]);
	    features_pos(i, :) = hog_feature; 
	end

Get random negative features

This function should return negative training examples (non-faces). Negative training examples is selected randomly. I try to extract same number features from each face-free image


	image_files = dir( fullfile( non_face_scn_path, '*.jpg' ));
	
	N = length(image_files);
	D = (feature_params.template_size / feature_params.hog_cell_size)^2 * 31;
	features_neg = zeros(N, D);
	nSamplePerImage = ceil(num_samples / N);

	for i = 1 : N
	    fprintf('get negative non-face feature of image %d\n', i);
	    image = im2single(rgb2gray(imread(fullfile(non_face_scn_path, image_files(i).name))));
	    maxPatchXIndex = size(image, 2) - feature_params.template_size;
	    maxPatchYIndex = size(image, 1) - feature_params.template_size;
	    sampleNum = min([nSamplePerImage, maxPatchXIndex, maxPatchYIndex]);
	    samplePatchXindex =  randsample(maxPatchXIndex, sampleNum);
	    samplePatchYindex =  randsample(maxPatchYIndex, sampleNum);
	    for j = 1 : sampleNum
	        patch = image(samplePatchYindex(j) : samplePatchYindex(j) + feature_params.template_size - 1, samplePatchXindex(j) : samplePatchXindex(j) + feature_params.template_size - 1);
	        hog_feature = vl_hog(patch, feature_params.hog_cell_size);
	        hog_feature = reshape(hog_feature, [1, D]);
	        features_neg((i - 1) * nSamplePerImage + j, :) = hog_feature;
	    end
	end

Train Classifier

I choose lambda = 0.0001


	X = cat(1, features_pos, features_neg);
	Y = cat(1, ones(size(features_pos, 1), 1), -1 * ones(size(features_neg, 1), 1));
	lambda = 0.0001;
	[w, b] = vl_svmtrain(X', Y', lambda);

Run detector

I choose confidence threshold as 0.2. HOG features are extracted at Scales = [1.0, 0.8, 0.6, 0.5, 0.25, 0.125]; When the image is scaled, we also need to scale the coordinates of detection bounding box


	test_scenes = dir(fullfile(test_scn_path, '*.jpg'));
	
	%initialize these as empty and incrementally expand them.
	bboxes = zeros(0, 4);
	confidences = zeros(0, 1);
	image_ids = cell(0, 1);

	threshold = 0.2;
	nCellPerTemplate = feature_params.template_size / feature_params.hog_cell_size;
	D = nCellPerTemplate ^ 2 * 31;
	%Scales = [1.0];
	Scales = [1.0, 0.8, 0.6, 0.5, 0.25, 0.125];

	for i = 1 : length(test_scenes)
	    fprintf('Detecting faces in %s\n', test_scenes(i).name)
	    image = imread(fullfile( test_scn_path, test_scenes(i).name));
	    image = single(image) / 255;
	    if(size(image,3) > 1)
	        image = rgb2gray(image);
	    end
  
	    cur_x_min = []; cur_y_min = [];
	    cur_x_max = []; cur_y_max = [];
	    cur_confidences = [];
	    for s = 1 : length(Scales)
	        scaled_image = imresize(image, Scales(s));
	        hog_features = vl_hog(scaled_image, feature_params.hog_cell_size);
	        for j = 1 : size(hog_features, 1) - nCellPerTemplate
	            for k = 1 : size(hog_features, 2) - nCellPerTemplate
	                template_hog_feature = hog_features(j : j + nCellPerTemplate - 1, k : k + nCellPerTemplate - 1, :);
	                template_hog_feature = reshape(template_hog_feature, [1, D]);
	                score = template_hog_feature * w + b;
	                if score > threshold
	                    y_min = (j - 1) * feature_params.hog_cell_size;
	                    x_min = (k - 1) * feature_params.hog_cell_size;
	                    y_max = y_min + feature_params.template_size - 1;
	                    x_max = x_min + feature_params.template_size - 1; 
	                    y_min = floor(y_min / Scales(s)) + 1; x_min = floor(x_min / Scales(s)) + 1;
	                    y_max = floor(y_max / Scales(s)) + 1; x_max = floor(x_max / Scales(s)) + 1;
	                    if x_max > size(image, 2) || y_max > size(image, 1)
	                        fprintf('j: %d  k: %d\n', j, k);
	                        fprintf('image size: %d %d\n', size(image, 1), size(image, 2));
	                        fprintf('x_min: %d  y_min: %d  y_max: %d  x_max: %d\n', x_min, y_min, y_max, x_max);
	                        error('out of bound!');
	                    else 
	                        cur_x_min = [cur_x_min; x_min]; cur_y_min = [cur_y_min; y_min];
	                        cur_x_max = [cur_x_max; x_max]; cur_y_max = [cur_y_max; y_max];
	                        cur_confidences = [cur_confidences; score];
	                    end
	                end
	            end
	        end
	    end
	    cur_bboxes = [cur_x_min, cur_y_min, cur_x_max, cur_y_max];
	    cur_image_ids(1 : size(cur_bboxes, 1), 1) = {test_scenes(i).name};
	    
	    if size(cur_bboxes, 1) ~= 0
	        [is_maximum] = non_max_supr_bbox(cur_bboxes, cur_confidences, size(image));
	        cur_confidences = cur_confidences(is_maximum,:);
	        cur_bboxes      = cur_bboxes(     is_maximum,:);
	        cur_image_ids   = cur_image_ids(  is_maximum,:);
	        bboxes      = [bboxes; cur_bboxes];
	        confidences = [confidences; cur_confidences];
	        image_ids   = [image_ids;   cur_image_ids];
	    else
	        fprintf('0 detection\n');
	    end
	end


Result


Accuracy


hog_cell_size = 6

hog_cell_size = 4

hog_cell_size = 3

Some Examples





Extra point


1. hard negative mining

I extract hard example in the same scales as in run_detector.m. To avoid extracting too many hard examples, I set a nMax to limit the number.

	image_files = dir( fullfile( non_face_scn_path, '*.jpg' ));
	nCellPerTemplate = feature_params.template_size / feature_params.hog_cell_size;
	D = nCellPerTemplate ^ 2 * 31;
	nMax = 5000;
	features_neg_hard = zeros(nMax, D);
	%Scales = [1.0];
	Scales = [1.0, 0.8, 0.6, 0.5, 0.25, 0.125];
	threshold = 0.2;
	index = 1;
	for i = 1 : length(image_files)
	    fprintf('get hard negative feature of image %d\n', i);
	    image = im2single(rgb2gray(imread(fullfile(non_face_scn_path, image_files(i).name))));
	    if(index > nMax)
	        break; 
	    end
	    for s = 1 : length(Scales)
	        scaled_image = imresize(image, Scales(s));
	        hog_features = vl_hog(scaled_image, feature_params.hog_cell_size);
	        for j = 1 : size(hog_features, 1) - nCellPerTemplate
	            for k = 1 : size(hog_features, 2) - nCellPerTemplate
	                template_hog_feature = hog_features(j : j + nCellPerTemplate - 1, k : k + nCellPerTemplate - 1, :);
	                template_hog_feature = reshape(template_hog_feature, [1, D]);
	                score = template_hog_feature * w + b;
	                if score > threshold && index <= nMax
	                    features_neg_hard(index, :) = template_hog_feature;
	                    index = index + 1;
	                end
	            end
	        end
	    end
	end
	features_neg_hard = features_neg_hard(1 : index - 1, :);

In my experiemnt, hard negative mining does not improve the accuracy. In fact, it even reduces the accuracy, although I restrict the random negative number to be 5000.

Including hard negative
<

Not including hard negative
<

Perhaps when learning the hard negative helps us exclude false positives, at the same time it also rejects some true positives. Since the accuracy computation provided in the template code doesn't penalize false positives, the accuracy tends to be lower.

With hard negatives

Without hard negatives


2. utilize alternative positive training data

I flip and rotate the original face image and extract HOG features to augment the positive training data.

	rotate_image = imrotate(image, rotate_angle(j), 'bilinear', 'crop');
	hog_feature1 = vl_hog(rotate_image, feature_params.hog_cell_size);
	flip_image = fliplr(image);
	hog_feature2 = vl_hog(flip_image, feature_params.hog_cell_size);

Augment may not show to improve accuracy if the original method already has a high accuracy with a low threshold. So to see if the augment can increase the accuracy, I set a high threshold = 1.5 to make accuracy lower.


Augment

Original


References

[1] Navneet Dalal, Bill Triggs. Histograms of Oriented Gradients for Human Detection, Vision and Pattern Recognition (CVPR'05), 2005