Computers cannot fully understand spoken language without access to the wide
range of modalities that accompany speech. This thesis addresses the
particularly expressive modality of hand gesture, and focuses on building
structured statistical models at the intersection of speech, vision, and
My approach is distinguished in two key respects. First, gestural patterns are leveraged to discover parallel structures in the meaning of the associated speech. This differs from prior work that attempted to interpret individual gestures directly, an approach that was prone to a lack of generality across speakers. Second, I present novel, structured statistical models for multimodal language processing, which enable learning about gesture in its linguistic context, rather than in the abstract.
These ideas find successful application in a variety of language processing tasks: resolving ambiguous noun phrases, segmenting speech into topics, and producing keyframe summaries of spoken language. In all three cases, the addition of gestural features -- extracted automatically from video -- yields significantly improved performance over a state-of-the-art text-only alternative. This marks the first demonstration that hand gesture improves automatic discourse processing.
Winner of the George M. Sprowls award for best theses in Computer Science at MIT.
Cite as technical report MIT-CSAIL-TR-2008-027
Defended April 14 2008, submitted May 7 2008.