Deep Learning for Perception
Georgia Tech, Spring 2016

Note: Dhruv Batra will be teaching this course in Fall 2017. I will be teaching Machine Learning CS 4641.

Course: CS 8803DL
Instructor: Dr. Zsolt Kira
Location: Instructional Center 111 (NOTE recent change!)
Day/Time: MWF 2-3pm
Office Hours: MWF 12:30-2pm CCB 260
TA: Daehyung Park, Ashwin Shenoi


April 20, 2016: Final project requirements up here

March 10, 2016:
Project mid-term progress description up here

February 2, 2016:
Teams are out (see piazza). Note project proposals due on the week of the 15th (TBD, probably Friday). See here for information about projects.

Jan 25, 2016: Reminder writeups due Monday (01/25) night! See project information and Additional Readings for project ideas.

Jan 7, 2016:
First readings are up. NOTE the change in classroom location to the Instructional Center.

Dec, 2015: The course is now full. If you would like a spot but haven’t gotten one, I recommend 1) Adding your name to the waitlist once Phase II registration starts, 2) Come to the first week of class. If there are students who drop you can then fill their spot.

Nov 1, 2015: The course is now open to M.S. students as well!

October 29, 2015: Note that the CoC has a new system for registration; currently you can only waitlist and students on that list will be given the opportunity to register on a first-come, first-serve basis on Nov. 5th. See here and here for more information and instructions for joining the waitlist.

October 28, 2015: The course has been listed, and will be initially restricted to PhD students. Subsequently it will be opened to M.S. students. Please sign up as soon as possible if you are interested, as it fills up fast. For some idea about the previous version of the course, see here. Note that there will be adjustments/changes from the previous version.

Course Description and Goals

This course will cover deep learning and its applications to perception in many modalities, focusing on those relevant for robotics (images, videos, and audio). Deep learning is a sub-field of machine learning that deals with learning hierarchical features representations in a data-driven manner, representing the input data in increasing levels of abstraction. This removes the need to hand-design features when applying machine learning to different modalities or problems. It has recently gained significant traction and media coverage due to its state-of-the-art performance in tasks such as object detection in computer vision (see ILSVRC2013 and 2014 as an example), terrain estimation for navigation in robotics, natural language processing, and others.

The course will cover the fundamental theory behind these techniques, with topics ranging from sparse coding/filtering, autoencoders, convolutional neural networks, and deep belief nets. We will cover both supervised and unsupervised variants of these algorithms, and motivate them by showing real-world examples in perception-related tasks, including computer vision (object recognition/classification, activity recognition, etc.), perception for robotics (obstacle avoidance, grasping), and more. We will discuss some of the previous state-of-the-art methods and how they relate to the deep learning algorithms that have recently replaced them. The principles will also be related to neuroscience and other fields to facilitate a discussion about what these new advancement mean for understanding intelligence more generally as well limitations and open problems. The course will involve a project where students will be able to take relevant research problems in their particular field, apply the techniques and principles learned in the course to develop an approach, and implement it to investigate how these techniques are applicable. Results of the project will be presented at the end of the semester to fellow students to convey what was learned, what results were achieved, and future open research areas

Learning Outcomes

This course has several learning outcomes. By the end of the course, students will be able to:


This is a graduate class. The course will cover advanced machine learning topics, as related to perception, so a prior introductory machine learning, artificial intelligence, computer vision, or pattern recognition course is recommended. Strong math skills especially linear algebra, will be essential to understanding many of these techniques. Since deep learning is new (or rather, has become mainstream recently), there is no book available yet; so, the course readings will be tutorials and conference/journal papers, many of which are from machine learning conferences such as NIPS/ICML. Being able to read, analyze, and hopefully critique such research papers is therefore crucial!

Course Material

There is no textbook currently published, but we may refer to the in-progress book by Bengio et al.. The rest of the course material will be publicly available publications, or when necessary distributed material for tutorials that are not publicly accessible.

Write-Ups for Reading Material

In some weeks, there will be a quiz on the readings on Friday. Other weeks will require a write-up of the reading materials. There will be a link to a set of reading questions next to that week’s readings.


Many of the tutorials and assignments will use Torch so you should set up the corresponding coding environment for LUA and go through the tutorials. There are a lot of public tutorials/resources for this framework, and we will also hold a lab. We will also provide a Docker instance which is a lightweight virtual machine with everything set up. This is what will be used to grade assignments. If you wish to use this resource, I recommend setting docker up on your machine.


Below is the tentative schedule of topics that will be covered. Note that there are optional additional readings if you would like to explore a particular topic in more depth.
Mind Map for this course
Week Topic Readings Slides Assignments
Introduction and Background:
Machine Learning, Features, and Non-linear mappings
[1] (Ch. 1-2, 5)
[2] [3]
Previous state of the art: Features, Sparse Coding, Pooling, and Classification
ML Background, Neural Networks
[03 Unsupervised Feature Learning] Quiz Friday
NOTE: No class Monday due to Holiday!
Neural Networks (Cont.)
[1] (Ch. 6) Feedforward Deep Networks
Friday: SNOW DAY
Monday: Torch Lab
Backpropagation, Theory and Practice
(Non-Linear Optimization, Stochastic/Mini-Batch Gradient Descent)
[1] (Ch. 7) Regularization,
Writeup due Monday via T-square (Week 2 reading)
Assignment 1 Released
Sparse Autoencoders (SAE)
Convolutional Neural Networks (CNNs) and early applications
[1] (Ch 8.1,8.2 Optimization, 14 Autoencoders)
[08 NN Optimization] (updated)
Quiz Monday on Week 3 Readings
Architectures and Tasks (Classification, Detection, Segmentation)
[1] (Ch. 9 Convolutional Networks)
Optional: [11]
Assignment 1 Due Feb. 9
Updated:Feb 10

Quiz Friday Monday
Visualization/Deconvolutional Networks, Pooling and invariance
Recurrent Neural Networks (RNNs)
Friday: Project Proposal Presentations
[15] [16] [17] [12 Visualization]
Writeup due Monday via T-square (Week 6 reading)
Recurrent Neural Networks (RNNs) and their optimization [1] 10.1 - 10.5 (Sequence Modeling)
Echo State Networks and RNN Applications
Hessian-Free Optimization
[1] 10.6 - 10.15
[20] [21]
Optional: [22]
Quiz Monday
Applications (CNN + RNN): Recent 2014 Successes: Descriptor Matching, Stereo-based Obstacle Avoidance for Robotics, Captioning, Encoder/Decoder
Deep learning for language: Word/sentence vectors, parsing, sentiment analysis, etc.
[19 Word Embeddings] (updated)
No quiz/writeup
Language continued, Attention Mechanisms
Other architectures: Highway Networks, etc.
Writeup due Friday
Mid-Term Project Status
Background: Probabilistic Graphical Models
Hopfield Nets, Boltzmann machines, Restricted Boltzmann Machines
Hopfield Networks, (Restricted) Bolzmann Machines
[1] Ch. 16 & 17 [22 Restricted Boltzmann Machines ]
Deep Belief Nets, Stacked RBMs
Applications to NLP , Pose and Activity Recognition in Videos
Recent Advances
[1] Ch. 20
Student-requested topics, such as Reinforcement Learning/Atari, Large-Scale Learning, Neural Turing Machines
Project Poster Presentations
Poster presentations due (04/22 noon)
Assignment 4 due (04/24)
During Reading Week:
Monday: Project winner presentations, wrapup
Final Project Report Due (05/02)


  1. “Deep Learning”, Yoshua Bengio and Ian J. Goodfellow and Aaron Courville, Book in preparation for MIT Press.
  2. Emergence of simple-cell receptive field properties by learning a sparse code for natural images. B.A. Olshausen, and D.J. Field. Nature, 1996.
  3. Efficient sparse coding algorithms.H. Lee, A. Battle, R. Raina, and A. Y. Ng. NIPS, 2007.
  4. Linear Classification, from Stanford’s CS231n Convolutional Neural Networks for Visual Recognition Course.
  5. Optional: What size neural network gives optimal generalization? Convergence properties of backpropagation., Lawrence, S., Giles, C.L., Tsoi, A.C., 1998.
  6. Optional: Efficient BackProp, Y. LeCun, L. Bottou, G. Orr and K. Muller. In Orr, G. and Muller K. (Eds), Neural Networks: Tricks of the trade, Springer, 1998.
  7. Stochastic Gradient Tricks, Léon Bottou: Neural Networks, Tricks of the Trade, Reloaded, 430–445, Edited by Grégoire Montavon, Genevieve B. Orr and Klaus-Robert Müller, Lecture Notes in Computer Science (LNCS 7700), Springer, 2012.
  8. Reducing the dimensionality of data with neural networks. G. Hinton and R. Salakhutdinov. Science 2006.
  9. Optional: Learning Mid-Level Features for Recognition.Y. Boureau, F. Bach, Y. LeCun and J. Ponce. CVPR, 2010.
  10. ImageNet Classification with Deep Convolutional Neural Networks.A. Krizhevsky, I. Sutskever, and G. Hinton. NIPS, 2012.
  11. Optional: Convolutional Neural Networks, from Stanford’s CS231n Convolutional Neural Networks for Visual Recognition Course.
  12. Optional: A theoretical analysis of feature pooling in visual recognition., Boureau, Y-Lan, Jean Ponce, and Yann LeCun. Proceedings of the 27th International Conference on Machine Learning (ICML-10). 2010.
  13. Optional: Measuring invariances in deep networks.I.J. Goodfellow, Q.V. Le, A.M. Saxe, H. Lee and A.Y. Ng. NIPS 2009.
  14. Optional: Online learning for offroad robots: Using spatial label propagation to learn long-range traversability, Hadsell, R., Sermanet, P., Ben, J., Erkan, A., Han, J., Flepp, B., Muller, U., LeCun, Y., 2007. in: Proc. of Robotics: Science and Systems (RSS). p. 32.
  15. Deep inside convolutional networks: Visualising image classification models and saliency maps, Simonyan, K., Vedaldi, A., Zisserman, A., 2013. arXiv preprint arXiv:1312.6034.
  16. Visualizing and Understanding Convolutional Neural Networks. Zeiler, M.D., Fergus, R., 2013. arXiv preprint arXiv:1311.2901.
  17. Striving for Simplicity: The All Convolutional Net, Springenberg, J.T., Dosovitskiy, A., Brox, T., Riedmiller, M., 2014, arXiv:1412.6806 [cs].
  18. Optional: Adaptive deconvolutional networks for mid and high level feature learning, Zeiler, M.D., Taylor, G.W., Fergus, R., 2011. in: Computer Vision (ICCV), 2011 IEEE International Conference on. IEEE, pp. 2018–2025.
  19. Optional: Adaptive deconvolutional networks for mid and high level feature learning
  20. Deep Learning via Hessian-Free Optimization., Martens, James. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), 735–42, 2010.
  21. Generating Text with Recurrent Neural Networks.Sutskever, Ilya, James Martens, and Geoffrey E. Hinton. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), 1017–24, 2011.
  22. Optional: “On the difficulty of training recurrent neural networks,” R. Pascanu, T. Mikolov, and Y. Bengio, arXiv preprint arXiv:1211.5063, 2012.
  23. “CNN Features off-the-shelf: an Astounding Baseline for Recognition,”, A. S. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson, arXiv:1403.6382 [cs], Mar. 2014.
  24. Offline Handwriting Recognition with Multidimensional Recurrent Neural Networks, Graves, Alex, and Jürgen Schmidhuber. In Advances in Neural Information Processing Systems, 545–52, 2009.
  25. “Deep visual-semantic alignments for generating image descriptions,”A. Karpathy and L. Fei-Fei, arXiv preprint arXiv:1412.2306, 2014.
  26. Distributed representations of words and phrases and their compositionality, Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J., Advances in Neural Information Processing Systems. pp. 3111–3119.
  27. Skip-Thought Vectors, Kiros, R., Zhu, Y., Salakhutdinov, R.R., Zemel, R., Urtasun, R., Torralba, A., Fidler, S., 2015. in: Advances in Neural Information Processing Systems 28. Curran Associates, Inc., pp. 3294–3302.
  28. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention, Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhutdinov, R., Zemel, R., Bengio, Y., 2015. arXiv:1502.03044 [cs].
  29. Optional:A neural probabilistic language model. Bengio, Y., Ducharme, R., Vincent, P., Janvin, C., 2003. The Journal of Machine Learning Research 3, 1137–1155.
  30. Optional: Efficient Estimation of Word Representations in Vector Space. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. In Proceedings of Workshop at ICLR, 2013.
  31. Marginalized denoising autoencoders for domain adaptation. Chen, M., Xu, Z., Weinberger, K., Sha, F., 2012. arXiv preprint arXiv:1206.4683.
  32. Tutorial on RBMs
  33. Learning deep architectures for AI. Foundations and Trends in Machine Learning, Bengio, Y. , pp. 1–127, 2009.
  34. An introduction to restricted Boltzmann machines, Fischer, A., Igel, C., 2012. in: Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications. Springer, pp. 14–36.
  35. Bengio IFT 6266 Lecture Notes
  36. Greedy Layer-Wise Training of Deep Networks.Y. Bengio, P. Lamblin, P. Popovici, and H. Larochelle. NIPS 2006.
  37. Multimodal learning with deep boltzmann machines, Srivastava, N., Salakhutdinov, R.R., in: Advances in Neural Information Processing Systems. pp. 2222–2230, 2012.
  38. Graphical Models in a Nutshell, D. Koller, N. Friedman, L. Getoor, and B. Taskar (2007), in L. Getoor and B. Taskar, editors, Introduction to Statistical Relational Learning.
  39. Building high-level features using large scale unsupervised learning. Le, Q., Ranzato, M., Monga, R., Devin, M., Corrado, G., Chen, K., Dean, J., Ng, A., 2012. Presented at the 29th International Conference on Machine Learning (ICML 2012), pp. 81–88.
  40. Neural Turing Machines, Graves, A., Wayne, G., Danihelka, I., 2014. arXiv:1410.5401 [cs].
  41. Playing Atari with Deep Reinforcement Learning. Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., Riedmiller, M., 2013. arXiv:1312.5602 [cs].
  42. Mastering the game of Go with deep neural networks and tree search, Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., van den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., Dieleman, S., Grewe, D., Nham, J., Kalchbrenner, N., Sutskever, I., Lillicrap, T., Leach, M., Kavukcuoglu, K., Graepel, T., Hassabis, D., 2016. Nature 529, 484–489. doi:10.1038/nature16961
  43. Human-level control through deep reinforcement learning, Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A.A., Veness, J., Bellemare, M.G., Graves, A., Riedmiller, M., Fidjeland, A.K., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S., Hassabis, D., 2015. Nature 518, 529–533. doi:10.1038/nature14236

Collaboration/Honor Policy

Students are expected to adhere to the Honor Code in this class. All work is to be accomplished independently unless expressly stated in writing otherwise (e.g., as in a team project). Do not plagiarize from any sources including the internet. Collaboration on other homework and/or take home exams is not permitted.


There are three major components to this course. The first is participation and readings: You will be expected to read all of the assigned material for class, and this will be assessed by EITHER a quiz or a short write-up analyzing the material. There will only be one of these per week, and we will alternate between quizzes and writeups. Quizzes will be short (around 15 minutes) to make sure you have done the readings. The second component is assignments, of which there will be about 4-5 throughout the semester. This will typically involve some combination of written problems, coding, and a small writeup. There will be no exams, since you will be evaluated throughout based on these first two components.
The major component of the course will be your semester-long project that will take a problem of your choosing, design an approach or set of experiments, implement it, and analyze the results. Projects can involve teams. You will be able to propose your own ideas around the fifth week of the course, and I will approve or refine them with you. I will also provide a list of potential topics if you have trouble coming up with one. The project proposals will be presented to the class, there will be a mid-term review of progress to make sure it’s moving along, and a final presentation will be made describing the approach and results.

Office Hours

Office hours are TBD but will likely happen in the hours before the actual class. They are also possible by appointment; feel free to email me at any time you feel it necessary to meet!

Additional Resources