Modern cameras and photo organization tools have prominent face detection capabilities. For this project I am implementing a sliding window detector as mentioned in Dalal and Triggs 2005 for face detection. It focuses more on the representation rather than the learning and introduces the SIFT-like Histogram of Gradients (HoG) representation.The pipeline of the project looks like - traning positive and negative samples. Using a SVM as a classifier. Using the learnt classifier to classify the sliding windows as either a face or not. This is done at multiple scales of the image.
The steps involved in this project are:
Some good face detections made with the detector.
Left to Right: The first figure represents the HoG features for a cell size of 6. The second figure is the avergae precision curve for the same cell size. The third plot represents the curve for the number of correct detections vs the false positives and is similar to Viola Jones.
The images provided for training are well structured and are already in the suitable format (36 X 36). HoG features are computed for these images using the vl_hog function. This gives us a 6 X 6 X 31 sized ouput which is reshaped to 1 X 1116 feature vectors.
Here the images given are in the color scale. These have to be converted to grayscale images to match those given for the positive sample. Also, the sizes of the images are not consistent and hence we have to crop out random patches of the images from the set (non-faces) given to obtain the 36 X 36 patches. The HoG features for these random patches are computed and after flattening we obtain the 1 X 1116 feature vector. I have implemented Hard Mining to the negative features, to add the false positives detected as my negative features. Please look at the Graduate/Extra credit section for this.
Now that we have the positive and negative features we need a classifier to be trained using this data. The linear SVM is chosen for this purpose and the vl_svmtrain() is used to do this. The lambda value I have used in 0.0001 and the labels corresponding to the positive and negative features are 1 and -1 respectively.
The model trained so far is now tested on the test set of images. The concept of a sliding window is used to determine a potential detection. This is done using the trained SVM and by computing the HoG features for the patch of the sliding window. The confidence value(for the patch being a possible detection) is computed using w'X+b, where w and b are the parameters learnt from the trained SVM. If this confidence value computed exceeds a certain threshold value given then it is identified as a positive detection. All these positive detections are stored together after which a non-maximum suppresion is performed on them to remove the duplicates or the near-duplicates. These are then visualized. Perforing the above mentioned steps on different scales of the image helps increase the accuracy of the detector.
The average precision value that I obtained for this base implementation is 82.1%. This score improved to around 85% on performing hard mining which I have included in the Graduate/Extra credit section.
Hard mining is performed to artifically generate more negative training samples. For this purpose, we run the detector on the non faces images, collect all the false positives, that are definitely not faces since they come from the non faces set of images and append these to our negative training set. The HoG features are computed for these and are appended to the negative features. The run detector is again run on this. We observe that the performance changes by around 3% (base implementation --> base implementation+hard mining). Although this is not a huge difference, one observation is that it reduces the number of false positives detected and also makes predicitions is some cases.
The comparison (the comparison in both sections above) between the first two images shows how there is a reduction in the number of false positives detected. The compariosn between the next two images shows how adding extra data can help in making a detection which was not showing up in the base implementation.
From Left to Right: The precision curve and the curve matching the one in Viola Jones.
Since we could try out with the different classifiers, I tried out Decision Trees (a non linear classifier). This did not perform well on the dataset, so I tried an ensemble classifier - Random Forests. The following are the observations I made:
This is how the tree parameters look like: classreg.learning.classif. ResponseName: 'Y' CategoricalPredictors: [] ClassNames: [-1 1] ScoreTransform: 'none' NumObservations: 16851 NumTrained: 3 Method: 'AdaBoostM1' LearnerNames: {'Tree'} ReasonForTermination: 'Terminated normally after completing the requested number of training cycles.' FitInfo: [3x1 double] FitInfoDescription: {2x1 cell}
Although, when you observe the results you can see how the boundaries for the tree are being generated. Another interesting fact that was observed was that this classifier gives detections for some of the images which were not showing up earlier.
From Left to Right: For the second and the foruth images the way the decision tree is building the boundaries is clearly visible with the detection windows it is providing. The third image is interesting because both the faces do not show up in the base implementation but they atleast one shows up here. Similarly for the second image as well.
From Left to Right: Figure 1 represents the ensemble for the first layer. We can observe the number of splits to be one here. The curves show that these values are not great for this dataset. However changing the parameter values may result in some good and interesting predictions.
Since we could try out with the different feature descriptors, I tried out the SIFT and the PHOW feature descriptors appended to the HoG features. The following are the observations I made:
I have tried to implement the hog feature extracter. Computing histograms for every 30 degree span. This is done for the given cell size of 6 and computed for each of these patches.