CS 8803-VLM

Course Information

This will be a seminar + project type class, with a reading list covering some of the fundamentals (Transformers, etc.), Multi-Modal Models (LLaVA, BLIP, etc.), as well as emerging areas including Open-Vocabulary Perception, Vision-Language Reasoning, Multi-Modal Decision-Making Agents, and Embodied AI. Participation will be required given that the course is largely discussion-driven.

Recommended Background

We will assume that you already have some background in machine learning, deep learning/neural networks, and potentially some of the modalities (e.g. computer vision, natural language processing, etc.). Note that we will not expect that you have deep experience in all of the modalities - it is OK if you are strong in one modality and weaker in another. We will also assume that you are comfortable in executing an interesting project in this area.

Instructor

Zsolt Kira

Schedule Note: The topics are still being determined! Feel free to also email for suggestions of important papers in these areas

Week #	Date	Topic	Presenters
1		Introduction to Vision-Language Models
2		Deep Dive into Transformers
3		Vision Transformers
		Vision-Language Models: BLIP(-2), etc.
		Vision-Language Models: LLaVA 1, 1.5, 1.6
		Open-Vocabulary Classification, Detection, Segmentation
		General architectures (Unified-IO/etc.) and multi-modal models (audio, etc.)
		Multi-Modal Reasoning and Question-Answering
		VLMs for Decision-Making Web and GUI Agents
		VLMs for Embodied AI

Logistics

Logistics to be determined. The grading will include paper reviews and presentations, participation, and project (which will include various aspects including the project proposal, report, presentation/video, etc.). Participation will be required given that the course is largely discussion-driven.

Course Information

Recommended Background

Instructor

Logistics

CS 8803 VLM Vision-Language Foundation Models

CS 8803 VLM
Vision-Language Foundation Models