Project 5 / Face Detection with a Sliding Window


Table of Contents:

Motivation:

The sliding window plays an integral role in object classification, as it allows us to localize exactly where in an image an object resides. A sliding window is a rectangular space of fixed height and width that slides across an image. A visualization of the process is as follows:

The sliding window will effectively allow us to independently classify all image patches as being object or non-object. For face classification, the sliding winow is one of the most noticeable successes of computer vision. Navneet Dalal and Bill Triggs' Histograms of Orineted Gradients for Human Detection outlines a simple but effective algorithm for face detection using a sliding window. Dalal-Triggs focues on representation more than learning and introduces the SIFT-like Histogram of Gradients (HoG) representation. The feature extraction and object detection pipeline that Dalal-Triggs introduced is as follows:

Let us quickly overview this pipeline:

Project Outline:

In our past projects we have implemented a SIFT descriptor, and thus we will not implement the SIFT-like Histogram of Gradients representation. However, we will implent the rest of the pipeline: handling heterogenous training and testing data, training a linear classifier (a HoG template), and using our classifier to classify millions of sliding windows at multiple scales.

Code Details:

Our project will consist of the following matlab files:

Code Details:

Let us run proj5.m without any implementation and observe the initial results:

Initial classifier performance on train data:

accuracy: 0.500
true positive rate: 0.500
false positive rate: 0.500
true negative rate: 0.000
false negative rate: 0.000

Plots


Precision/Recall


Recall/False Positives


P3

The Precision/Recall chart is a chart that plots precision versus recall. Precision is the fraction of retrieved instances that are relevant, while recall is the fraction of retrieved instances that are retrieved. We can quantify precision and recall as follows:

$$\text{Precision} = \frac{\text{true positives}}{\text{true positives + false positives}}$$

$$\text{Recall} = \frac{\text{true positives}}{\text{true positives + false negatives}}$$

The trade-off between precision and recall can be observed using a precision-recall curve.

In our precision-recall curve we see that we have a small line because a random detector is a poor choice. It is very difficult to randomly guess face locations, unlike secene classification which has a $\frac{1}{15}\sim 7\%$ chance. So, precision and recall are very low. Looking at our results we see that the average precision is basically 0.000. This is to be expected though as we have not implemented any testing details.

Let us now implement get_positive_features.m and see how our training data statistics change.

Get_Positive_Features:

Summary:

This function will return all positive training examples (faces) from 36x36 images in 'train_path_pos'. Each face will be converted into a HoG template according to 'feature_params'.

Code Details:

Let us now run proj5.m with the newly improved get_positive_feature.m function. The results are as follows:

Initial classifier performance on train data:

accuracy: 0.985
true positive rate: 0.985
false positive rate: 0.015
true negative rate: 0.000
false negative rate: 0.000

Plots


Precision/Recall


Recall/False Positives


P3

Compared to no implementation, we see that our training accuracy is much better and that our average precision is still 0.000 as we have not implemented any test data testing of our code. We also note that our true positive rate increased and the false positive rate decreased. Our true negative rate and false negative rate remained the same.

Let us now move on to implement get_random_negative_features.m.

Get_Random_Negative_Features:

Summary:

This function will return negative training examples (non-faces) from any images in 'non_face_scn_path'.

Code Details:

Let us now run proj5.m with the newly improved get_random_negative_features.m function. The results are as follows:

Initial classifier performance on train data:

accuracy: 0.405
true positive rate: 0.405
false positive rate: 0.595
true negative rate: 0.000
false negative rate: 0.000

Plots


Precision/Recall


Recall/False Positives


P3

We see that our average precision is still 0 as we have not yet implemented our classifier training or testing of our test data. Once we do this we should see a boost in precision. We noticed that our train accuracy did go down to .405. Our true positive rate went down and our false positive rate went up. This is fine as we still have more files to implement. Let us now implement our classifier training and examine how the accuracy changes.

Classifier Training:

Summary:

We will use vl_svmtrain on your training features to get a linear classifier specified by w and b. We will implement this in the file proj5.m in the section Step 2. Train Classifier.

Code Details:

Let us now run proj5.m and take a look at the output:

Initial classifier performance on train data:

accuracy: 0.999
true positive rate: 0.404
false positive rate: 0.000
true negative rate: 0.595
false negative rate: 0.001

Plots


Precision/Recall


Recall/False Positives


P3

We see that our accuracy has increased but our precision is still zero (we will implement the run_detector next and will have an average precision value that is non-zero). Let us implement run_detector.m next and then we can run the full pipeline and parameter tune to see our results.

Run Detector:

Summary:

This function returns detections on all of the images in a given path. We will use non-maximum suppression on a per image basis on our detections to increase performance. Initially the code returns random bounding boxes in each test image. However, we will change it so that it converts each test image to HoG feature space with a single call to vl_hog for each scale. We will then step over the HoG cells, take groups of cells that are the same size as our learned template, and then classify them. If the classification is above some confidence, we will keep the detection and then pass all the detections for an image to non-maximum suppression.

Code Details:

With these steps we should now get a precision that is not zero. Let us run our pipeline with a different set of thresholds, lambdas, scales, and sample sizes and examine the average precision:

Threshold Lambda Scale Num Samples Average Precision
0.7 0.0001 6 10,000 0.841
0.6 0.0001 6 20,000 0.832
0.8 0.0001 6 10,000 0.818
0.7 0.0001 6 11,500 0.813
0.4 0.0001 6 15,000 0.841
0.67 0.0001 6 15,000 0.803
0.7 0.001 6 12,000 0.835
0.7 0.0005 6 12,000 0.852

I found that a lambda of 0.0005 worked best for me. This regularization parameter is important for training our linear SVM. If it is too high or too low we will get underfitting or overfitting on our training data. A threshold of 0.7 worked nicely for me. Here there was a good balance between accuracy and minimization of red boxes on our images. If I lower the threshold too much than there is better precision but more red when we examine the test output. I noticed that the number of samples did not make a drastic difference for me. When bumping up from 10,000 to 11,000 or 12,000 there was a slight positive difference. With 20,000 I did not notice too much of a difference that was justifiable with the addional computational expense incured. To get the best precision I used a lambda of 0.0005, a threshold of 0.7, and 12,000 as num_negative_examples.

Some output for this set of parameters is as follows:

Plots


Precision/Recall


Recall/False Positives


P3


P4

Some test output is as follows:













We see that for the most part our face recognition came out pretty nicely. There are some false positives in the bottom images but we are finding faces a good percentage of the time.

Let us take a look at another setup of parameters for comparison: lambda = 0.0001, threhsold of 0.7, and 15,000 negative samples:





Some test output is as follows:













We see that the lambda of 0.0005 did better for our program. We had more face findings and had less false positives. If we were to lower our threshold more we would see much more false positives but most likely higher precision.

Let us now use our parameters lambda = 0.0005, threshold = 0.7, and num_negative_samples = 12,000 and examine the average precision with different pixel cell sizes:

Threshold Lambda Scale Num Samples Average Precision
0.7 0.0005 6 12,000 0.852
0.7 0.0005 4 12,000 0.886
0.7 0.0005 3 12,000 0.918

With a 4 pixel cell size our results were as follows:





Some test output is as follows:









Some output images for the 3 pixel cell size are as follows:









We note that the 3 pixel cell size with our combination of parameters gives us the best precision and facial matching, however it is computationally expensive. The best combination of precision and speed was a 4 pixel cell size, which took roughly 6 minutes and 45 seconds to run, producing an average precision of approximately 0.873 across 10 runs of the program.

Let us now run our code with the class test images. The results are as follows:

We see that for the most part, the face detection is pretty good despite the few incorrect green boxes we have. Note: I had memory issues with my machine and had to do some rescaling of the images to get the vectors to not exceed memory allocations.

Graduate Extra Credit:

Hard Negative Mining:

We will be implementing hard negative mining for our graduate extra credit. Let us quickly summarize what hard negative mining will do for us. Say I give you a collection of images and bounding boxes for each image. Our classifier will need both positive training examples (face) and negative training examples (non-faces). Our positive training examples come from looking inside the bounding box for each person/image. However, how do we create useful negative training examples? If we generate a bunch of random bounding boxes and for each that does not overlap with any positives, we keep that as a negative. We now have some positives and negatives, so we can train a classifier and test it with our training images and a sliding window. However, this may give us a high amount of false positive. It might be thinking that there are faces when there are not. We can use a hard negative to falsely detect a patch, and explicitly create a negative example from that patch. We can then add that negative to our training set. Now upon retraining the classifier we should have better performance as we have additional knowledge. Now, we should have less false positives.

Our implementation is as follows:

Let us now parameter tune and see which values work best for our hard_mining.m function:

With a t=0.85, a threshold of 0.8 in run_detector2, a classifier_lambda = .0005, and a 6 pixel cell size our results are the following:





Some test output is as follows:





Previously, we were getting around 81-83% but now we are getting over 85% precision. After turning parameters, with a cell size of 6, I ultimately arrived at a precision of 89% on one run of my program. The plot is as follows:

When we use a cell size of 4, our precision is slightly larger than that of what we previously had. However, there is not as much of a difference as there is in the 6 pixel cell size.

Thus, we see that using the negative hard mining has helped! In terms of computational complexity, it does not cost too much and gives us slightly better results so it is definitely a worthwile implementation.

Additional Training Data:

Here, we are going to augment our data and see how the precision changes.

To track all our changes we will implement a new file: augmented.m that will depend on augmented_pos_feats.m and augmented_neg_feats.m. These two files are copies of our get_positive_features and get_random_negative_features except they have some code that alters the training data.

Let us first flip the rows in our image horizontally. This will flip our image. We can make use of matlab's built-in function flipud for this. After flipping our training data and using the same parameters as above with a 6 pixel cell size we see that our output is as follows:





Compared to our previous output of ~83%, we see that our precision has dropped a great deal but we are still getting around 50% precision! This is very interesting as it seems like we are still detecting faces even with upside down faces as our training data.

Looking through some of the training examples, I noticed that some of them were slightly blurry. Let us now sharpen our image and see if we get any improvement in accuracy.





We see that our accuracy went up a slight bit. I re-ran the program and constantly achieved a slightly higher precision with the sharpening. This seems to help a few of the blurry images become more easily recognizable for face detection.

Let us try filtering our image with a gaussian filter, so we have a blurred effect. My hypothesis is that the precision will drop as the images are harder to detect so our classifier will have a hard time. Let us check out the results below:





The precision dropped significantly! We are now down from low-mid 80% to high 60-low 70%. Blurring our image really did make a difference in terms of facial recognition.

Now, let's really enahnce the colors of our images using matlab's decorrstretch and imcoloradjust. We will use a coloradjust of ([.10,.79],[0.00,1.00],1.10). Here the image will look more vibrant. The results are as follows:





We see that our precision here is on par with our original test data.

Let us now overlay the a cropped version of the image over itself and turn it a slight green/yellow color. We will rotate the original image using 5,'bicubic','crop' and then we will fuse this rotated and transformed image with the original image and use the parameters 'falsecolor','Scaling','joint','ColorChannels',[1 2 0]. This might cause some confusion to the image, it will make it look like its been all shaken up. Let's take a look at the precision plots below:





Our precision here is terrible we have dropped into the 50's from the 80's. The shaken double effect really made it difficult to detect a face here.

Let us run a Canny Edge detection on each image and use this array as our training data. I have some worries here as we do not have pictures with bland backgrounds of just a face. We have scenery in our image so I am predicting that our classifier is going to perform very poorly here. If there are buildings and other scenery, I forsee this trying to detect these as faces. The results are as follows:





As we can see, this did in fact do very poorly. If we take a look at one of the produced output's below

we see that the bounding boxes were looking at the edges that defined the man in the figure. Using the canny image detector might be better for classifying some other object that is not a face, such as a particular car model for instance.

All in all we see that augmenting and filtering our training data did have an effect on the final average precision. When we sharpened our image we got slighly higher average precision. When we blurred our training data we had lower average precision. When we applied some strange filters to our training data we also had lower average precision especially when we used the canny edge detector. So, augmenting our training data did make a substantial impact. There is no huge computational expense associated with augmenting the data as my program only took a few more seconds to run, but sharpening the images was a nice small bump in average precision. The files used in the above implementations are augment.m, augmented_pos_feats.m, and augmented_neg_feats.m. To run the program you simply go into augmented_pos_feats and augmented_neg_feats and select the augmentation you want on the training data. You then run augmented.m.

Conclusion:

We saw that as we implemented the pipeline we saw an increase in the average precision. There was a good bit of parameter tuning, but after finding the right parameters the average precision was very nice. For extra credit I implemented Hard Negative Mining where I saw a nice boost in average precision and noticed that there were much less false positives. In addition, I implemented new training data through augmentation. I looked at a variety of shapes of the data and combinations of filters of the data to see how the average precision changed as a result of this change in data. I noticed that for some combinations such as sharpening the average precision increased, but for some combinations such as a more color intense cropped verison of the image overlayed with itself the average precision dropped sharply. This makes it easy to undestand that our training data is important. If we have poor quality or confusing training data, then our resulting test data precision will not be as strong as we would hope for. I had a great time implementing this project. I have always been interested in facial recognition and I was finally able to implement my own facial recognition program! This was a great project!

Sources: