Course Information
This will be a seminar + project type class, with a reading list covering some of the fundamentals (Transformers, etc.), Multi-Modal Models (LLaVA, BLIP, etc.), as well as emerging areas including Open-Vocabulary Perception, Vision-Language Reasoning, Multi-Modal Decision-Making Agents, and Embodied AI. Participation will be required given that the course is largely discussion-driven.
Recommended Background
We will assume that you already have some background in machine learning, deep learning/neural networks, and potentially some of the modalities (e.g. computer vision, natural language processing, etc.). Note that we will not expect that you have deep experience in all of the modalities - it is OK if you are strong in one modality and weaker in another. We will also assume that you are comfortable in executing an interesting project in this area.
Week # | Date | Topic | Presenters |
---|---|---|---|
1 | Introduction to Vision-Language Models | ||
2 | Deep Dive into Transformers | ||
3 | Vision Transformers | ||
Vision-Language Models: BLIP(-2), etc. | |||
Vision-Language Models: LLaVA 1, 1.5, 1.6 | |||
Open-Vocabulary Classification, Detection, Segmentation | |||
General architectures (Unified-IO/etc.) and multi-modal models (audio, etc.) | |||
Multi-Modal Reasoning and Question-Answering | |||
VLMs for Decision-Making Web and GUI Agents | |||
VLMs for Embodied AI |