Example face detection results from this project. We're intentionally trying to foil the face detector in this photo.

Project 5: Face detection with a sliding window
CS 6476: Computer Vision




The sliding window model is conceptually simple: independently classify all image patches as being object or non-object. Sliding window classification is the dominant paradigm in object detection and for one object category in particular -- faces -- it is one of the most noticeable successes of computer vision. For example, modern cameras and photo organization tools have prominent face detection capabilities. These success of face detection (and object detection in general) can be traced back to influential works such as Rowley et al. 1998 and Viola-Jones 2001. You can look at these papers for suggestions on how to implement your detector. However, for this project you will be implementing the simpler (but still very effective!) sliding window detector of Dalal and Triggs 2005. Dalal-Triggs focuses on representation more than learning and introduces the SIFT-like Histogram of Gradients (HoG) representation (pictured to the right). Because you have already implemented the SIFT descriptor, you will not be asked to implement HoG. You will be responsible for the rest of the detection pipeline, though -- handling heterogeneous training and testing data, training a linear classifier (a HoG template), and using your classifier to classify millions of sliding windows at multiple scales. Fortunately, linear classifiers are compact, fast to train, and fast to execute. A linear SVM can also be trained on large amounts of data, including mined hard negatives.


Please reinstall your conda environment as we have made changes to the list of packages installed.

  1. Install Miniconda. It doesn't matter whether you use 2.7 or 3.6 because we will create our own environment anyways.
  2. Create a conda environment using the appropriate command. On Windows, open the installed "Conda prompt" to run this command. On MacOS or Linux, you can just use a terminal window to run the command. Modify the command based on your OS (linux, mac, or win).
    conda env create -f environment_<OS>.yml
  3. This should create an environment named cs6476p5. Activate it using the following Windows command:
    activate cs6476p5
    or the following MacOS/Linux command:
    source activate cs6476p5
  4. Run the notebook using:
    jupyter notebook ./code/proj5.ipynb
  5. Generate the submission once you've finished the project using:
    python zip_submission.py


For this project, you need to implement the following parts:

Get features from positive samples

You will implement get_positive_features() to load cropped positive trained examples (faces) and convert them to HoG features. We provide the default feature parameters in proj5.ipynb. (You are free to try different parameters, but it doesn't guarantee better performance.) For improved performance, You can try mirroring or warping the positive training examples to augment your training data. Please refer to the documentation for more details.

Get features from negative samples

You will implement get_random_negative_features() to sample random negative examples from scenes which contain no faces and convert them to HoG features. The output feature dimension from get_random_negative_features() and get_positive_features() should be the same. For best performance, you should sample random negative examples at multiple scales. Please refer to the documentation for more details.

Mine hard negatives

You will implement mine_hard_negs() as discussed in Dalal and Triggs, and demonstrate the effect on performance. The main idea of hard negative mining is that you use the trained classifier to find false-positive examples, and then include them in your negative training data, so you can train the classifier again to improve the performance. This might not be very effective for frontal faces unless you use a more complex feature or classifier. You might notice a bigger difference if you artificially limit the amount of negative training data (e.g. a total budget of only 5000 negatives).

Train a linear classifer

You will implement train_classifier() to train a linear classifier from the positive and negative features by using scikit-learn LinearSVC. The regularization constant C is an important parameter that affects the classification performance. Small values seem to work better e.g. 1e-4, but you can try other values.

Detect faces on the test set

You will implement run_detector() to detect faces on the testing images. For each image, You will run the classifier with a sliding window at multiple scales and then call the provided function non_max_suppression_bbox() to remove duplicate detections.

Your code should convert each test image to HoG feature space for each scale. Then you step over the HoG cells, taking groups of cells that are the same size as your learned template, and classifying them. If the classification is above some confidence, keep the detection and then pass all the detections for an image to non_max_suppression_bbox(). The ouputs are the coordinates ([xl, yl, xh, yh]) of all the detected faces and their corresponding confidences and testing image indices. Please See run_detector() documentation for more details.

run_detector() will have (at least) two parameters which can heavily influence performance:

Using the starter code (proj5.ipynb)

The top-level starter code proj5.ipynb provides data loading, evaluation and visualization functions and calls the functions in student_code.py. If you run the code unmodified, it will return random bounding boxes in each test image. It will even do non-maximum suppression on the random bounding boxes to give you an example of how to call the function. detect random faces in the test images. We provide the following functions to help you examine the performance:


The choice of training data is critical for this task. While an object detection system would typically be trained and tested on a single database (as in the Pascal VOC challenge), face detection papers were previously trained on heterogeneous, even proprietary, datasets. As with most of the literature, we will use three databases: (1) positive training crops, (2) non-face scenes to mine for negative training data, and (3) test scenes with ground truth face locations.

You are provided with a positive training database of 6,713 cropped 36x36 faces from the Caltech Web Faces project. We arrived at this subset by filtering away faces which were not high enough resolution, upright, or front facing. There are many additional databases available For example, see Figure 3 in Huang et al. and the LFW database described in the paper. You are free to experiment with additional or alternative training data for extra credit.

Non-face scenes, the second source of your training data, are easy to collect. We provide a small database of such scenes from Wu et al. and the SUN scene database. You can add more non-face training scenes, although you are unlikely to need more negative training data unless you are doing hard negative mining for extra credit.

The most common benchmark for face detection is the CMU+MIT test set. This test set contains 130 images with 511 faces. The test set is challenging because the images are highly compressed and quantized. Some of the faces are illustrated faces, not human faces. For this project, we have converted the test set's ground truth landmark points in to Pascal VOC style bounding boxes. We have inflated these bounding boxes to cover most of the head, as the provided training data does. For this reason, you are arguably training a "head detector" not a "face detector" for this project.

Please do not include the data sets in your handin.

Useful Functions

vlfeat.hog.hog(). The main function to extract the HOG feature. For detailed information, please refer to the link or the comments in student_code.py and proj5.ipynb.

sklearn.svm.LinearSVC(). Linear support vector classification. The regularization constant C is the most critical parameter affecting the classification performance.

Banned Functions

You should not use any third-party library that do multi-scale object detection, such as cv2.HOGDescriptor().detectMultiScale(). You need to implement the detection function by yourself. You may also not use anyone else's code. The main principle is that you should not use any third-party library that can directly perform one of these functions: get_positive_features(), get_random_negative_features(), mine_hard_negs(), train_classifier() or run_detector(). If you still have quetions, feel free to ask on Piazza.

Write up

For this project, and all other projects, you must do a project report in HTML. In the report you will describe your algorithm and any decisions you made to write your algorithm a particular way. Then you will show and discuss the results of your algorithm (e.g. if you want to discuss one parameter that you have experiments, we expect you show how this parameter affects the performance with the support of your experiment results, and some descriptions about how and why). Particularly, we want to see the effect of the 'C' parameter in SVM, multi-scale detection and hard negative mining. Discuss any extra credit you did, and clearly show what contribution it had on the results (e.g. performance with and without each extra credit component).

We suggest you show your face detections on the class photos in the data/extra_test_scenes directory.

You should include the precision-recall curve of your final classifier and the visualization of the learned hog template, and any interesting variants of your algorithm. Do not include code or your GTID in your report.

Extra Credit

For all extra credit, be sure to analyze on your web page whether your extra credit has improved classification accuracy. Each item is "up to" some amount of points because trivial implementations may not be worthy of full extra credit. You will not get the full extra credit if you only implement without enough analyses in your report. Some ideas:

Finally, there will be extra credit and recognition for the students who achieve the highest average precision. You aren't allowed to modify evaluate_detections() which measures your accuracy.

Web-Publishing Results

All the results for each project will be put on the course website so that the students can see each other's results. In class we will highlight the best projects as determined by the professor and TAs. If you do not want your results published to the web, you can choose to opt out.

Handing in

This is very important as you will lose points if you do not follow instructions. Every time after the first that you do not follow instructions, you will lose 5 points. The folder you hand in must contain the following:

Hand in your project as a zip file through Canvas. This zip must be less than 5MB. If images are taking up too much space, you can use things like imagemagick to shrink them. Please Note that when grading, we will not be using your notebook, so make sure extra credit is thoroughly documented in the README, and the function signatures and return values in student_code.py are unchanged.


Final Advice


Project description and code by James Hays, and has been expanded and edited by Samarth Brahmbhatt, Amit Raj and Min-Hung (Steve) Chen. Figures in this handout are from Dalal and Triggs. Thanks to Jianxin Wu and Jim Rehg for suggestions in developing this project.

We tried to make an especially easy test case with neutral, frontal faces.

The 2011 class effectively demonstrate how not to be seen by a robot.

Fall 2016 faces detected by Wenqi Xian.