AlexNet / VGG-F network visualized by mNeuron.

Project 6: Deep Learning
Introduction to Computer Vision



This project is an introduction to deep learning tools for computer vision. You will design and train deep convolutional networks for scene recognition using the MatConvNet toolbox.

Remember project 4: Scene recognition with bag of words. You worked hard to design a bag of features representations that achieved 60 to 70% accuracy (most likely) on 15-way scene classification. You might have done the spatial pyramid extra credit and gotten up to nearly 80% accuracy. We're going to attack the same task with deep learning and get higher accuracy. Sort of -- training from scratch won't work quite as well as project 4, fine-tuning an existing network will work much better than project 4.

In Part 1 of the project you will train a deep convolutional network from scratch to recognize scenes. The starter code gives you a very simple network architecture which doesn't work that well and you will add jittering, normalization, regularization, and more layers to increase recognition accuracy to 50, 60, or perhaps 70%. Unfortunately, we only have 1,500 training examples so it doesn't seem possible to train a network from scratch which outperforms hand-crafted features (extra credit to anyone who proves us wrong).

For Part 2 you will instead fine-tune a pre-trained deep network to achieve more than 85% accuracy on the task. We will use the pretrained VGG-F network which was not trained to recognize scenes at all.

These two approaches represent the most common approach to recognition problems in computer vision today -- train a deep network from scratch if you have enough data (it's not always obvious whether or not you do), and if you cannot then instead fine-tune a pre-trained network.

Starter Code Outline

The following is an outline of the stencil code:

The deep network training will be performed by cnn_train.m which in turn calls vl_simplenn.m but you will not need to modify those functions for this project.

Part 0

First make sure that MatConvNet is working. Step through the following MatConvNet "Quick start" demo. You can simply copy and paste the commands into the Matlab command window.

% install and compile MatConvNet 
% (you can skip this if you already installed MatConvNet beta 16 and the mex files)
% untar('') ;
% cd matconvnet-1.0-beta16
% run matlab/vl_compilenn

% download a pre-trained CNN from the web (needed once)
urlwrite('', ...
         'imagenet-vgg-f.mat') ;

% setup MatConvNet. Your path might be different.
run  '../matconvnet-1.0-beta16/matlab/vl_setupnn'

% load the 233MB pre-trained CNN
net = load('imagenet-vgg-f.mat') ;

% load and preprocess an image
im = imread('peppers.png') ;
im_ = single(im) ; % note: 0-255 range
im_ = imresize(im_, net.normalization.imageSize(1:2)) ;
im_ = im_ - net.normalization.averageImage ;

% run the CNN
res = vl_simplenn(net, im_) ;

% show the classification result
scores = squeeze(gather(res(end).x)) ;
[bestScore, best] = max(scores) ;
figure(1) ; clf ; imagesc(im) ;
title(sprintf('%s (%d), score %.3f',...
net.classes.description{best}, best, bestScore)) ;
It might take a while to download the 233MB VGG-F network used in the demo, but you will need it for part 2 of the project anyway.

Troubleshooting. If you encounter errors trying to run this demo, make sure: (1) You have MatConvNet 1.0 beta 16 (not a later version). (2) You download imagenet-vgg-f.mat from the course website and not from MatConvNet (because it was changed for beta 17 and it not backwards compatible) (3) Your mex files are in the correct location [MatConvNetPath]/matlab/mex/. If you encounter errors about invalid mex files in Windows you may be missing Visual C++ Redistributable Packages. If you encouter an error about about labindex being undefined you may be missing the parallel computing toolbox for Matlab. That toolbox is available with the Georgia Tech student version of Matlab.

Before we start building our own deep convolutional networks, it might be useful to have a look at MatConvNet's tutorial. In particular, you should be able to understand Part 1 of the tutorial. In addition to the examples shown in parts 3 and 4 of the tutorial, MatConvNet has example code for training networks to recognize the MNIST and CIFAR datasets. Your project follows the same outline as those examples. Feel free to take a look at that code for inspiration. You can run the example code to watch the training process MNIST and CIFAR. Training will take about 5 and 15 minutes for those datasets, respectively.

Compiling MatConvNet with GPU support is more complex and not needed for this project. If you're trying to do extra credit and find yourself frustrated with training times you can try, though.

Part 1: training a deep network from scratch

Run proj6_part1.m and bask in the glory of deep learning. Gone are the days of hand designed features. Now we have end-to-end learning in which a highly non-linear representation is learned for our data to maximize our objective (in this case, 15-way classification accuracy). Instead of an anemic 70% accuracy we can now recognize scenes with... 25% accuracy. OK, that didn't work at all. What's going on?

First, let's take a look at the network architecture used in this experiment. Here is the code from proj6_part1_cnn_init.m that specifies the network structure:

Let's make sure we understand what's going on here. This simple baseline network has 4 layers -- a convolutional layer, followed by a max pool layer, followed by a rectified linear layer, followed by another convolutional layer. This last convolutional layer might be called a "fully connected" or "fc" layer because its output has a spatial resolution of 1x1. Equivalently, every unit in the output of that layer is a function of the entire previous layer (thus "fully connected"). But mathematically there's not really any difference from "convolutional" layers so we specify them in the same way in MatConvNet.

Let's look at the first convolutional layer. The "weights" are the filters being learned. They are initialized with random numbers from a Gaussian distribution. The inputs to randn(9,9,1,10) mean the filters have a 9x9 spatial resolution, span 1 filter depth (because the input images are grayscale), and that there are 10 filters. The network also learns a bias or constant offset to associate with the output of each filter. This is what zeros(1,10) initializes.

The next layer is a max pooling layer. It will take a max over a 7x7 sliding window and then subsample the resulting image / map with a stride of 7. Thus the max pooling layer will decrease the spatial resolution by a factor of 7 according to the stride parameter. The filter depth will remain the same (10). There are other pooling possibilities (e.g. average pooling) but we will only use max pooling in this project.

The next layer is the non-linearity. Any values in the feature map from the max pooling layer which are negative will be set to 0. There are other non-linearity possibilities (e.g. sigmoid) but we will use only rectified linear in this project.

Note that the pool layer and relu layer have no learned parameters associated with them. We are hand-specifying their behavior entirely, so there are no weights to initialize as in the convolutional layers.

Finally, we have the last layer which is convolutional (but might be called "fully connected" because it happens to reduce the spatial resolution to 1x1). The filters learned at this layer operate on the rectified, subsampled, maxpooled filter responses from the first layer. The output of this layer must be 1x1 spatial resolution (or "data size") and it must have a filter depth of 15 (corresponding to the 15 categories of the 15 scene database). This is achieved by initializing the weights with randn(8,8,10,15). 8x8 is the spatial resolution of the filters. 10 is the number of filter dimensions that each of these filters take as input and 15 is the number of dimensions out. 10 is highlighted in green to emphasize that it must be the same in those 3 places -- if the first convolutional layer has weights for 10 filters, it must also have offsets for 10 filters, and the next convolutional layer must take as input 10 filter dimensions.

At the top of our network we add one more layer which is only used for training. This is the "loss" layer. There are many possible loss functions but we will use the "softmax" loss for this project. This loss function will measure how badly the network is doing for any input (i.e. how different its final layer activations are from the ground truth, where ground truth in our case is category membership). The network weights will update, through backpropagation, based on the derivative of the loss function. With each training batch the network weights will take a tiny gradient descent step in the direction that should decrease the loss function (but isn't actually guaranteed to, because the steps are of some finite length, or because dropout regularization will turn off part of the network).

How did we know to make the final layer filters have a spatial resolution of 8x8? It's not obvious because we don't directly specify output resolution. Instead it is derived from the input image resolution and the filter widths, padding, and strides of the previous layers. Luckily MatConvNet provides a visualization function vl_simplenn_display to help us figure this out. Here is what it looks like if we specify the net as shown above and then call vl_simplenn_display(net, 'inputSize', [64 64 1 50]).
If the last convolutional layer had a filter size of 6x6 that would lead to a "data size" in the network visualization of 3x3 and we would know we need to change things (subsample more in previous layers or create wider filters in the final layer). In general it is not at all obvious what the right network architecture is. It takes a lot of artistry (read as: black magic) to design the right network and training strategy for optimal performance.

We just said the network has 4 real layers but this visualization shows 6. That's because it includes a layer 0 which is the input image and a layer 5 which is the loss layer. For each layer this visualization shows several useful attributes. "data size" is the spatial resolution of the feature maps at each level. In this network and most deep networks this will decrease as you move up thet network. "data depth" is the number of channels or filters in each layer. This will tend to increase as you move up a network. "rf size" is the receptive field size. That is how large an area in the original image a particular network unit is sensitive to. This will increase as you move up the network. Finally this visualization shows us that the network has 10,000 free parameters, the vast majority of them associated with the last convolutional layer.

OK, now we understand a bit about the network. Let's analyze its performance. After 30 training epochs (30 passes through the training data) Matlab's Figure 1 should look like this:
We'll be studying these figures quite a bit during this project so it's important to understand what it shows.

The left pane shows the training error (blue) and validation error (dashed orange) across training epochs. Each training epoch is a pass over the entire training set of 1500 images broken up into "batches" of 50 training instances. The code shuffles the order of the training instances randomly each epoch. When the network makes mistakes, it incurs a "loss" and backpropagation updates the weights of the network in a direction that should decrease the loss. Therefore the blue line should more or less decrease monotonically. On the other hand, the orange dashed line is the error incurred on the held out test set. The figure refers to it as "val" or "validation". In a realistic recognition scenario we might have three sets of data: train, validation, and test. We would use validation to assess how well our training is working and to know when to stop training and then we would test on a completely held out test set. For this project the validation set is our test set. We're trying to maximize performance on the validation set and that's it. The pass through the validation set does not change the network weights in any way. The pass through the validation set is also 3 times faster than the training pass because it does not have the "backwards" pass to update network weights.

The right pane shows the training and testing accuracy on the train and test (val) data sets across the same training epochs. It shows top 1 error -- how often the highest scoring guess is wrong -- and top 5 error -- how often all of the 5 highest scoring guesses are wrong. We're interested in top 1 error, specifically the top 1 error on the held out validation / test set.

In this experiment, the training and test top 1 error started out around 93% which is exactly what we would expect. If you have 15 categories and you make a random guess on each test case, you will be wrong 93% of the time. As the training progressed and the network weights moved away from their random initialization, accuracy increased.

Note the areas circled in green corresponding to the first 8 training epochs. During these epochs, the training and validation error were decreasing which is exactly what we want to see. Beyond that point the error on the training dataset kept decreasing, but the validation error did not. Our lowest error on the validation/test set is around 75% (or 25% accuracy). We are overfitting to our training data. This is hard to avoid with a small training set. In fact, if we let this experiment run for 200 epochs we see that it is possible for the training accuracy to become perfect with no appreciable increase in test accuracy:

Now we are going to take several steps to improve the performance of our convolutional network. The modifications we make in Part 1 will familiarize you with the building blocks of deep learning that can lead to impressive performance with enough training data. In the end, you might decide that this isn't any simpler than hand-designing a feature. Also, with the relatively small amount of training data in the 15 scene database, it is very hard to outperform hand-crafted features.

Learning rate. Before we start making changes, there is a very important learning parameter that you might need to tune any time you change the network or the data being input to the network. The learning rate (set by default as opts.LearningRate = 0.0001 in proj6_part1.m) determines the size of the gradient descent steps taken by the network weights. If things aren't working, try making it much smaller or larger (e.g. by factors of 10). If the objective remains exactly constant over the first dozen epochs, the learning rate might have been too high and "broken" some aspect of the network. If the objective spikes or even becomes NaN then the learning rate may also be too large. However, a very small learning rate requires many training epochs.

Problem 1: We don't have enough training data. Let's "jitter".

If you left-right flip (mirror) an image of a scene, it never changes categories. A kitchen doesn't become a forest when mirrored. This isn't true in all domains -- a "d" becomes a "b" when mirrored, so you can't "jitter" digit recognition training data in the same way. But we can synthetically increase our amount of training data by left-right mirroring training images during the learning process.

The learning process calls getBatch() in proj6_part1.m each time it wants training or testing images. Modify getBatch() to randomly flip some of the images (or entire batches). Useful functions: rand and fliplr.

You can try more elaborate forms of jittering -- zooming in a random amount, rotating a random amount, taking a random crop, etc. Mirroring helps quite a bit on its own, though, and is easy to implement. You should see a roughly 10% increase in accuracy by adding mirroring.

After you implement mirroring, you should notice that your training error doesn't drop as quickly. That's actually a good thing, because it means the network isn't overfitting to the 1,500 original training images as much (because it sees 3,000 training images now, although they're not as good as 3,000 truly independent samples). Because the training and test errors fall more slowly, you may need more training epochs or you may try modifying the learning rate.

Problem 2: The images aren't zero-centered.

One simple trick which can help a lot is to subtract the mean from every image. Modify proj6_part1_setup_data.m so that it computes the mean image and then subtracts the mean from all images before returning imdb. It would arguably be more proper to only compute the mean from the training images (since the test/validation images should be strictly held out) but it won't make much of a difference. After doing this you should see another 15% or so increase in accuracy.

Problem 3: Our network isn't regularized.

If you train your network (especially for more than the default number of epochs) you'll see that the training error can decrease to zero while the val top1 error hovers at 40% to 50%. The network has learned weights which can perfectly recognize the training data, but those weights don't generalize to held out test data. The best regularization would be more training data but we don't have that. Instead we will use dropout regularization. We add a dropout layer to our convolutional net as follows:
What does dropout regularization do? It randomly turns off network connections at training time to fight overfitting. This prevents a unit in one layer from relying too strongly on a single unit in the previous layer. Dropout regularization can be interpreted as simultaneously training many "thinned" versions of your network. At test test, all connections are restored which is analogous to taking an average prediction over all of the "thinned" networks. You can see a more complete discussion of dropout regularization in this paper.

The dropout layer has only one free parameter -- the dropout rate -- the proportion of connections that are randomly deleted. The default of 0.5 should be fine. Insert a dropout layer between your convolutional layers. In particular, insert it directly before your last convolutional layer. Your test accuracy should increase by another 10%. Your train accuracy should decrease much more slowly. That's to be expected -- you're making life much harder for the training algorithm by cutting out connections randomly.

If you increase the number of training epochs (and maybe decrease the learning rate) you should be able to achieve 60% test accuracy (40% top1 val) or slightly better at this point. Notice how much more structured the learned filters are at this point compared to the initial network before we made improvements:

Problem 4: Our network isn't deep.

Let's take a moment to reflect on what our convolutional network is actually doing. We learn filters which seem to be looking horizontal edges, vertical edges, and parallel edges. Some of the filters have diagonal orientations and some seem to be looking for high frequencies or center-surround. This learned filter bank is applied to each input image, the maximum response from each 7x7 block is taken by the max pooling, and then the rectified linear layer zeros out negative values. The fully connected layer sees a 10 channel image with 8x8 spatial resolution. It learns 15 linear classifiers (a linear filter with a learned threshold is basically a linear classifier) on this 8x8 filter response map. This architecture is reminiscent of hand-crafted features like the gist scene descriptor developed precisely for scene recoginition (on 8 scene categories which would later be incorporated into the 15 scene database). The gist descriptor actually works better than our learned feature. The gist descriptor with a non-linear classifier can achieve 74.7% accuracy on the 15 scene database.

Our convolutional network to this point isn't "deep". It has two layers with learned weights. Contrast this with the example networks for MNIST and CIFAR in MatConvNet which contain 4 and 5 layers, respectively. AlexNet and VGG-F contain 8 layers. The VGG "very deep" networks contain 16 and 19 layers. ResNet contains up to 150 layers.

One quite unsatisfying aspect of our current network architecture is that the max-pooling operation covers a window of 7x7 and then is subsampled with a stride of 7. That seems overly lossy and deep networks usually do not subsample by more than a factor of 2 or 3 each layer.

Let's make our network deeper by adding an additional convolutional layer in proj6_part1_cnn_init.m. In fact, we probably don't want to add just a convolutional layer, but another max-pool layer and relu layer, as well. For example, you might insert a convolutional layer after the existing relu layer with a 5x5 spatial support followed by a max-pool over a 3x3 window with a stride of 2. You can reduce the max-pool window in the previous layer, adjust padding, and reduce the spatial resolution of the final layer until vl_simplenn_display(net, 'inputSize', [64 64 1 50]), which is called at the end of proj6_part1_cnn_init() shows that your network's final layer (not counting the softmax) has a data size of 1 and a data depth of 15. You also need to make sure that the data depth output by any channel matches the data depth input to the following channel. For instance, maybe your new convolutional layer takes in the 10 channels of the first layer but outputs 15 channels. The final layer would then need to have its weights initialized accordingly to account for the fact that it operates on a 15 channel image instead of a 10 channel image.

We leave it up to you to decide the specifics of your slightly deeper network: filter depth, padding, max-pooling, stride, etc. The network will probably take longer to train because it will have more parameters and deeper networks take longer to converge. You might need to use more training epochs, but even then it will be difficult to outperform your shallow network. It is not required that your deeper network increases accuracy over the shallow network. As long as you can achieve 50% test accuracy for some epoch with a deeper network which uses mirroring to jitter, zero-centers the images as they are loaded, and regularizes the network with a dropout layer you will receive full credit for Part 1.

Additional optional improvements

Enjoy chasing higher accuracy? Here's some optional directions to investigate which might help improve your accuracy.

Part 2: fine-tuning a pre-trained deep network

One of the impressive things about the representations learned by deep convolutional networks is that they generalize surprisingly well to other recognition tasks (see DeCAF and the work of Razavian et al.). This is unexpected because these networks are discriminatively trained to perform well at a particular task so one might expect their representations to "overfit" for that task. And perhaps they do, but they still often exceed the performance of hand-crafted features when used in a new domain.

But how do we use an existing deep network for a new recognition task? Take, for instance, the VGG-F ("F" for "fast") network examined in Chatfield et al. BMVC 2014. This network is meant to be architecturally similar to the original AlexNet. Strategy A: The VGG-F network has 1000 units in the final layer corresponding to 1000 ImageNet categories. One could use those 1000 activation as a feature in place of a hand crafted feature such as a bag-of-features representation. You would train a classifier (typically a linear SVM) in that 1000 dimensional feature space. However, those activations are clearly very object specific and may not generalize well to new recognition tasks. It is generally better to use the activations in slightly earlier layers of the network, e.g. the 4096 activations in "fc6" or "fc7". You can often get away with sub-sampling those 4096 activations considerably, e.g. taking only the first 200 activations. This Strategy A for using an existing deep network was extra credit for project 4 and several students achieved high accuracy (especially when using a deep network trained on the Places database, but that isn't so much a testament to generalization because it's the same task with more training data).

Alternatively, Strategy B is to fine-tune an existing network. In this scenario you take an existing network, replace the final layer (or more) with random weights, and train the entire network again with images and ground truth labels for your recognition task. You are effectively treating the pre-trained deep network as a better initialization than the random weights used when training from scratch. When you don't have enough training data to train a complex network from scratch (e.g. with the 15 scene database) this is an attractive option. Fine-tuning can work far better than Strategy A of taking the activations directly from an pre-trained CNN. For example, in Lin et al's cross-view geolocalization work from CVPR 2015, there wasn't enough data to train a deep network from scratch, but fine tuning led to 4 times higher accuracy than using off-the-shelf networks directly.

For Part 2 of this project you will fine-tune the VGG-F network to perform scene recognition. If you did not already run the MatConvNet quick start demo, use this command to download and save the VGG-F network: urlwrite('', 'imagenet-vgg-f.mat')

The code for Part 2 will largely follow the same outline or even use the same code as Part 1. You will need to do a handful of things differently, though:

With these issues addressed, proj6_part2.m will call cnn_train just as in Part 1 and you should see very high accuracy. Training will naturally be a bit slow because the network is far bigger than in Part 1. However, you don't need many training epochs. Five training epochs was enough to achieve 87% accuracy and it is possible to approach (or perhaps exceed) 90% test accuracy. Compare this with the 2010 state-of-the-art performance of 88.1% accuracy achieved by combining more than a dozen existing and new features with a non-linear SVM and multiple kernel learning in the SUN Database paper.

It isn't necessary to retrain the entire network to achieve high accuracy. You could retrain just the new fc8 layer by setting opts.backPropDepth = 2 in proj6_part2.m. This basically implements Strategy A and would not be considered fine-tuning. But it is possible to do a strategy that falls in between. What if you only retrain the fully connected layers? What if you prune the pre-trained network down to the convolutional layers and only add a single fully connected layer to dramatically reduce the number of parameters? There are many possible strategies to explore and you will receive full credit as long as you achieve 85% accuracy by starting from VGG-F. You can additionally experiment with fine-tuning other networks such as VGG very deep networks or networks trained on the Places database, but be sure to report performance for and turn in code for fine-tuning VGG-F.

Write up

For this project, and all other projects, you must do a project report in HTML. In the report you will describe your algorithm and any decisions you made to write your algorithm a particular way. Then you will show and discuss the results of your algorithm. Discuss any extra credit you did, and clearly show what contribution it had on the results (e.g. performance with and without each extra credit component).

We suggest showing results plots (Matlab figure 1) and filter visualization (Matlab figure 2) as needed.

Extra Credit / Graduate Credit

There is NO required extra credit for graduate students for this project. The following extra credit is available for students in both 4476 and 6476. The max score for all students is 110.

For all extra credit, be sure to analyze on your web page whether your extra credit has improved classification accuracy. Each item is "up to" some amount of points because trivial implementations may not be worthy of full extra credit. Some ideas:

Web-Publishing Results

All the results for each project will be put on the course website so that the students can see each other's results. In class we will highlight the best projects as determined by the professor and TAs. If you do not want your results published to the web, you can choose to opt out.

Handing in

This is very important as you will lose points if you do not follow instructions. Every time after the first that you do not follow instructions, you will lose 5 points. The folder you hand in must contain the following:

Hand in your project as a zip file through


Final Advice


Project description and code by James Hays. Thanks to the MatConvNet team.