| Sponsor | Irfan Essa
206 CCB http://www.cc.gatech.edu/~irfan |
| Area | GVU / Intelligent Systems (and Computational Perception Lab / FCE) |
Problem
One of the most promising innovation to increase the robustness of
Automatic Speech Recognition systems (ASR) consists in making an efficient
use of the visual information that can be perceived from the speaker's
face, in addition to the standard acoustical information. In human communication,
it is well known that subjects with hearing impaired are able to retrieve
some linguistic information from a speaker's face (lip-reading). In the
past decades, a lot of perceptual experiments have proved that even listeners
with no hearing loss can also take benefit of visual information to correctly
perceive speech, especially when the acoustical environment is degraded
(background noise, other speakers). It turned out from these experiments
that the perception of speech by humans is always higher when both modalities,
audio and video, are used. This property of speech for being intrinsically
bimodal has been employed to implement automatic Audio-Visual Speech Recognition
(AVSR) system. The choice of what are the best visual features to represent
speech and how to combine them with acoustical data are still open questions.
In a recent article, Reveret et al. propose a representation of facial
movements for speech production, based on a set visual speech parameters.
Using visual tracking, these parameters can be extracted from a video sequence
of a speaker. The goal of this project is to implement an Audio-Visual
Speech Recognition system that combines acoustical information and these
visual speech parameters as visual information.
Here is what you will need to do