Deep Learning for Perception
Georgia Tech, Spring 2015
Zsolt Kira
Additional Readings

Please note: These are OPTIONAL readings in case you are interested in learning more about a particular topic. You will not be tested on them for the course.

1 Background

1.1 Computer Vision

Visual Object Recognition, Grauman and Liebe
- Section 3.3: Covers some of the hand-coded features (SIFT, etc.)
- Chapter 8: Representation of Object Categories - Covers various representations of objects, bag of words, parts-based representations, etc.
- Chapter 11: Example Systems: Generic Object Recognition - Covers typicaly features (Viola Jones, HOG, etc.) and pipelines (bag of words, etc.)

1.2 Optimization

Shewchuk, J.R., 1994. An introduction to the conjugate gradient method without the agonizing pain. Carnegie Mellon University, Pittsburgh, PA.

2 Sparse Coding

Spatial Pyramid Matching (SPM) and Locality-Constrained variants
- J. Yang, K. Yu, Y. Gong, and T. Huang. Linear spatial pyramid matching using sparse coding for image classification. Proc. of CVPR’09, 2009.
- Wang, J., Yang, J., Yu, K., Lv, F., Huang, T., Gong, Y., 2010. Locality-constrained linear coding for image classification, in: Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on. IEEE, pp. 3360–3367.
Structured Sparse Coding
- Bengio, S., Pereira, F., Singer, Y., Strelow, D., 2009. Group Sparse Coding, in: Bengio, Y., Schuurmans, D., Lafferty, J.D., Williams, C.K.I., Culotta, A. (Eds.), Advances in Neural Information Processing Systems 22. Curran Associates, Inc., pp. 82–89.
- Huang, J., Zhang, T., Metaxas, D., 2011. Learning with structured sparsity. The Journal of Machine Learning Research 12, 3371–3412.
- Jenatton, R., Mairal, J., Obozinski, G., Bach, F., 2011. Proximal methods for hierarchical sparse coding. The Journal of Machine Learning Research 12, 2297–2334.
Deep/Multi-layered Sparse Coding
- Bo, L., Ren, X., Fox, D., 2011. Hierarchical matching pursuit for image classification: Architecture and fast algorithms, in: Advances in Neural Information Processing Systems. pp. 2115–2123. Cheng, H., Liu, Z., Yang, L., Chen, X., 2013.
- Bo, L., Ren, X., Fox, D., 2013. Multipath sparse coding using hierarchical matching pursuit, in: Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on. IEEE, pp. 660–667.
- Lin, Y., Tong, Z., Zhu, S., Yu, K., 2010. Deep coding network, in: Advances in Neural Information Processing Systems. pp. 1405–1413.
- Yu, K., Lin, Y., Lafferty, J., 2011. Learning image representations from the pixel level via hierarchical sparse coding, in: Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on. IEEE, pp. 1713–1720.
Applications
- Dong, H., Wang, B., Lu, C.-T., 2013. Deep sparse coding based recursive disaggregation model for water conservation, in: Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence. AAAI Press, pp. 2804–2810.

3 Optimization Advice

Bengio, Y., 2012. Practical recommendations for gradient-based training of deep architectures. arXiv:1206.5533 [cs].

4 Sparse Autoencoders

Liou, C.-Y., Cheng, W.-C., Liou, J.-W., Liou, D.-R., 2014. Autoencoder for words. Neurocomputing 139, 84–96. doi:10.1016/j.neucom.2013.09.055
Le, Q., Ranzato, M., Monga, R., Devin, M., Corrado, G., Chen, K., Dean, J., Ng, A., 2012. Building high-level features using large scale unsupervised learning. Presented at the 29th International Conference on Machine Learning (ICML 2012), pp. 81–88.

5 Visualization

Mahendran, A., Vedaldi, A., 2014. Understanding Deep Image Representations by Inverting Them. arXiv:1412.0035 [cs].
Zeiler, Matthew D., and Rob Fergus. “Visualizing and Understanding Convolutional Neural Networks.” arXiv Preprint arXiv:1311.2901, 2013. http://arxiv.org/abs/1311.2901.

6 Applications

Atari: Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., Riedmiller, M., 2013. Playing Atari with Deep Reinforcement Learning. arXiv:1312.5602 [cs].
Go: Clark, C., Storkey, A., 2014. Teaching Deep Convolutional Neural Networks to Play Go. arXiv:1412.3409 [cs].
Text
- Mohit Iyyer, Jordan Boyd-Graber, Leonardo Claudino, Richard Socher and Hal Daumé III, A Neural Network for Factoid Question Answering over Paragraphs, Conference on Empirical Methods in Natural Language Processing (EMNLP 2014)
- K. Cho, B. van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, Y. Bengio. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. EMNLP 2014.
- Richard Socher, Andrej Karpathy, Quoc V. Le, Christopher D. Manning, Andrew Y. Ng. Grounded Compositional Semantics for Finding and Describing Images with Sentences, Transactions of the Association for Computational Linguistics (TACL 2014),
- Sutskever, Ilya, Oriol Vinyals, and Quoc VV Le. “Sequence to sequence learning with neural networks.” Advances in Neural Information Processing Systems. 2014.
- Mikolov, T., Chen, K., Corrado, G., Dean, J., 2013. Efficient Estimation of Word Representations in Vector Space. arXiv:1301.3781 [cs].
- Mikolov, Tomas, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. “Distributed representations of words and phrases and their compositionality.” In Advances in Neural Information Processing Systems, pp. 3111-3119. 2013.
- Antoine Bordes, Xavier Glorot, Jason Weston and Yoshua Bengio (2012), Joint Learning of Words and Meaning Representations for Open-Text Semantic Parsing, in: Proceedings of the 15th International Conference on Artificial Intelligence and Statistics (AISTATS)
- Sutskever, I., Martens, J., Hinton, G.E., 2011. Generating text with recurrent neural networks, in: Proceedings of the 28th International Conference on Machine Learning (ICML-11). pp. 1017–1024.
- Socher, R., Huang, E. H., Pennington, J., Ng, A. Y., and Manning, C. D. (2011a). Dynamic pooling and unfolding recursive autoencoders for paraphrase detection. In NIPS’2011.
- Socher, R., Pennington, J., Huang, E. H., Ng, A. Y., and Manning, C. D. (2011b). Semi-supervised recursive autoencoders for predicting sentiment distributions. In EMNLP’2011.
- Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P., 2011. Natural Language Processing (Almost) from Scratch. J. Mach. Learn. Res. 12, 2493–2537.
Speech/Acoustic
- Graves, A., Mohamed, A., Hinton, G., 2013. Speech recognition with deep recurrent neural networks. arXiv preprint arXiv:1303.5778.
- Hinton, G., Deng, L., Yu, D., Dahl, G.E., Mohamed, A., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T.N., others, 2012. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. Signal Processing Magazine, IEEE 29, 82–97.
- Dahl, G.E., Yu, D., Deng, L., Acero, A., 2012. Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. Audio, Speech, and Language Processing, IEEE Transactions on 20, 30–42.
- Seide, F., Li, G., Yu, D., 2011. Conversational Speech Transcription Using Context-Dependent Deep Neural Networks., in: Interspeech. pp. 437–440.
Multi-Modal
- Images+Text: Andrej Karpathy, Li Fei-Fei, Deep Visual-Semantic Alignments for Generating Image Descriptions, Technical Report
- Video+Audio: Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., Ng, A.Y., 2011. Multimodal deep learning, in: Proceedings of the 28th International Conference on Machine Learning (ICML-11). pp. 689–696.
Vision: Object Recognition
- Large Scale Visual Recognition Challenge
  - Wu, R., Yan, S., Shan, Y., Dang, Q., Sun, G., 2015. Deep Image: Scaling up Image Recognition. arXiv:1501.02876 [cs].
  - Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A., 2014. Going Deeper with Convolutions. arXiv:1409.4842 [cs].
  - Olga Russakovsky*, Jia Deng*, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg and Li Fei-Fei, ImageNet Large Scale Visual Recognition Challenge, arXiv:1409.0575, 2014.
  - Simonyan, K., Zisserman, A., 2014. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv:1409.1556 [cs].
  - He, K., Zhang, X., Ren, S., Sun, J., 2014. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. arXiv:1406.4729 [cs].
  - Krizhevsky, Alex, Ilya Sutskever, and Geoff Hinton. “Imagenet Classification with Deep Convolutional Neural Networks.” In Advances in Neural Information Processing Systems 25, 1106–14, 2012.
- Other Datasets/Competitions
  - Pascal VOC:Agarwal, P., Girshick, R., & Malik, J. (2014). Analyzing the Performance of Multilayer Neural Networks for Object Recognition http://arxiv.org/pdf/1407.1610v1.pdf
  - Pascal VOC: Szegedy, C., Toshev, A., & Erhan, D. (2013). Deep neural networks for object detection. In Advances in Neural Information Processing Systems 2013 (pp. 2553-2561).
- Detection
  - Iandola, Forrest, Matt Moskewicz, Sergey Karayev, Ross Girshick, Trevor Darrell, and Kurt Keutzer. “DenseNet: Implementing Efficient ConvNet Descriptor Pyramids.” arXiv Preprint arXiv:1404.1869, 2014. http://arxiv.org/abs/1404.1869.
  - Erhan, D., Szegedy, C., Toshev, A., & Anguelov, D. Scalable Object Detection using Deep Neural Networks. CVPR 2014
  - Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., LeCun, Y., 2013. Overfeat: Integrated recognition, localization and detection using convolutional networks. arXiv preprint arXiv:1312.6229.
  - Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2013). Rich feature hierarchies for accurate object detection and semantic segmentation. arXiv preprint arXiv:1311.2524.
  - Giusti, A., Ciresan, D.C., Masci, J., Gambardella, L.M., Schmidhuber, J.: Fast image scanning with deep max-pooling convolutional neural networks. ArXiV technical report (2013).
- Object Localization
  - Sivic, Josef, Ivan Laptev, Léon Bottou, and Maxime Oquab. “Learning and Transferring Mid-Level Image Representations Using Convolutional Neural Networks,” 2013. https://hal.inria.fr/hal-00911179/document.
  - Bergamo, A., Bazzani, L., Anguelov, D., Torresani, L., 2014. Self-taught Object Localization with Deep Networks. arXiv:1409.3964 [cs].
- Object Candidates (Selective Search)
  - Erhan, D., Szegedy, C., Toshev, A., & Anguelov, D. Scalable Object Detection using Deep Neural Networks. CVPR 2014
Vision: Semantic Segmentation/Scene Labeling
- Simultaneous detection and segmentation
- Eigen, D. and Fergus, R. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. arXiv:1411.4734, 2014.
- Mostajabi, M., Yadollahpour, P., and Shakhnarovich, G. Feedforward semantic segmentation with zoom-out features. arXiv:1412.0774, 2014.
- Long, J., Shelhamer, E., Darrell, T., 2014. Fully Convolutional Networks for Semantic Segmentation. arXiv:1411.4038 [cs].
- Clement Farabet, Camille Couprie, Laurent Najman and Yann LeCun, Learning Hierarchical Features for Scene Labeling, IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013.
Vision: Place Recognition
- Chen, Zetao, Obadiah Lam, Adam Jacobson, and Michael Milford. “Convolutional Neural Network-Based Place Recognition.” In ARC Centre of Excellence for Robotic Vision; School of Electrical Engineering & Computer Science; Institute for Future Environments; Science & Engineering Faculty. The University of Melbourne, Victoria, Australia, 2014. http://eprints.qut.edu.au/79662/.
Vision: Descriptor Matching
- Fischer, P., Dosovitskiy, A., Brox, T., 2014. Descriptor Matching with Convolutional Neural Networks: a Comparison to SIFT. arXiv:1405.5769 [cs].
Vision: Pose recognition
- Toshev, A., & Szegedy, C. Deeppose: Human pose estimation via deep neural networks. CVPR 2014
Other
- Dong, H., Wang, B., Lu, C.-T., 2013. Deep sparse coding based recursive disaggregation model for water conservation, in: Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence. AAAI Press, pp. 2804–2810.

7 Recent Developments

Theoretical
- Arora, S., Bhaskara, A., Ge, R., & Ma, T. Provable bounds for learning some deep representations. ICML 2014
Razavian, A.S., Azizpour, H., Sullivan, J., Carlsson, S., 2014. CNN Features off-the-shelf: an Astounding Baseline for Recognition. arXiv:1403.6382 [cs].
Salakhutdinov, R., Tenenbaum, J.B., Torralba, A., 2013. Learning with Hierarchical-Deep Models. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 1958–1971. doi:10.1109/TPAMI.2012.269

Deep Learning for Perception Georgia Tech, Spring 2015 Zsolt Kira Additional Readings

1 Background

1.1 Computer Vision

1.2 Optimization

2 Sparse Coding

3 Optimization Advice

4 Sparse Autoencoders

5 Visualization

6 Applications

7 Recent Developments

Deep Learning for Perception
Georgia Tech, Spring 2015
Zsolt Kira
Additional Readings