Homework 3 (coding) - CS 4803DL/7643

Brief

Due: Tuesday Mar 19 11:55pm
Starter code: starter code
Submit to Gradescope
This portion (HW3) counts 12% of your total grade

In this (short) homework, we will implement vanilla recurrent neural networks (RNNs) and Long-Short Term Memory (LSTM) RNNs and apply them to image captioning on COCO.

Note that this homework is adapted from the Standford CS231n course.

Setup

Assuming you already have homework 2 dependencies installed, here is some prep work you need to do. First, download the data:

cd cs231n/datasets
./get_assignment_data.sh

We will use PyTorch (v1.0) to complete the problems in this homework, and has been tested with python2.7 on Linux and Mac.

Part 1: Captioning with Vanilla RNNs (25 points)

Open the RNN_Captioning.ipynb Jupyter notebook, which will walk you through implementing the forward and backward pass for a vanilla RNN, first 1) for a single timestep and then 2) for entire sequences of data. Code to check gradients has already been provided.

You will overfit a captioning model on a tiny dataset and implement sampling from the softmax distribution and visualize predictions on the training and validation sets.

Part 2: Captioning with LSTMs (25 points)

Open the LSTM_Captioning.ipynb Jupyter notebook, which will walk you through the implementation of Long-Short Term Memory (LSTM) RNNs, and apply them to image captioning on MS-COCO.

Part 3: Train a good captioning model (15 points Extra Credit for CS4803DL, 5 points Regular Credit and 10 points Extra Credit for CS7643)

Using the pieces you implement in parts 1 and 2, train a captioning model that gives decent qualitative results (better than the random garbage you saw with the overfit models) when sampling on the validation set.

Code for evaluating models using the BLEU unigram precision metric has already been provided. Feel free to use PyTorch for this section if you’d like to train faster on a GPU. The start up code is provided in LSTM_Captioning.ipynb.

Here is how the scoring is going to work for this section. For a BLEU score of 0.20-0.25 you get 5 points, 0.25-0.30 you get 10 points, and >0.30 gets you 15 points. For 4803, this section is completely Extra Credit. For CS7643, we want you to achieve a score of >0.20 which gives you 5 points Regular Credit. If you beat 0.25, you get Extra Credit according to the score you achieve.

Here are a few pointers:

Attention-based captioning models
- Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. Xu et al., 2015
- Knowing When to Look: Adaptive Attention via A Visual Sentinel for Image Captioning. Lu et al., CVPR 2017
Discriminative captioning
- Context-aware Captions from Context-agnostic Supervision. Vedantam et al., CVPR 2017
Novel object captioning
- Deep Compositional Captioning: Describing Novel Object Categories without Paired Training Data. Hendricks et al., CVPR 2016
- Captioning Images with Diverse Objects. Venugopalan et al., CVPR 2017

Summary Deliverables

Code Submission

Submit the results by uploading a zip file called hw3.zip created with the following command

cd assignment/
./collect_submission.sh

Write-Up Submission

Convert all IPython notebooks to PDF files with the following command**

  jupyter-nbconvert --to pdf filename.ipynb

Notes

You should only upload ONE PDF file to the HW3 Writeup section, and then assign the pages properly as you did for PS3.
You should upload hw3.zip, which includes no PDF file, to the HW3 Code section.

References:

CS231n Convolutional Neural Networks for Visual Recognition