In this (short) homework, we will implement vanilla recurrent neural networks (RNNs) and Long-Short Term Memory (LSTM) RNNs and apply them to image captioning on COCO.

Note that this homework is adapted from the Standford CS231n course.

Download the starter code here.

Setup

Assuming you already have homework 2 dependencies installed, here is some prep work you need to do. First, download the data:

cd cs231n/datasets
./get_assignment_data.sh

Part 1: Captioning with Vanilla RNNs (25 points)

Open the RNN_Captioning.ipynb Jupyter notebook, which will walk you through implementing the forward and backward pass for a vanilla RNN, first 1) for a single timestep and then 2) for entire sequences of data. Code to check gradients has already been provided.

You will overfit a captioning model on a tiny dataset and implement sampling from the softmax distribution and visualize predictions on the training and validation sets.

Part 2: Captioning with LSTMs (25 points)

Open the LSTM_Captioning.ipynb Jupyter notebook, which will walk you through the implementation of Long-Short Term Memory (LSTM) RNNs, and apply them to image captioning on MS-COCO.

Part 3: Train a good captioning model (Extra credit: up to 15 points)

Using the pieces you implement in parts 1 and 2, train a captioning model that gives decent qualitative results (better than the random garbage you saw with the overfit models) when sampling on the validation set.

Code for evaluating models using the BLEU unigram precision metric has already been provided. Feel free to use PyTorch for this section if you’d like to train faster on a GPU.

Here are a few pointers:

Deliverables

Submit the notebooks and code you wrote with all the generated outputs. Run collect_submission.sh to generate the zip file for submission.

References:

  1. CS231n Convolutional Neural Networks for Visual Recognition