Robust Speech Recognition using Audio and Visual Information


Sponsor Irfan Essa
206 CCB
http://www.cc.gatech.edu/~irfan
Area GVU / Intelligent Systems (and Computational Perception Lab / FCE)

Problem
One of the most promising innovation to increase the robustness of Automatic Speech Recognition systems (ASR) consists in making an efficient use of the visual information that can be perceived from the speaker's face, in addition to the standard acoustical information. In human communication, it is well known that subjects with hearing impaired are able to retrieve some linguistic information from a speaker's face (lip-reading). In the past decades, a lot of perceptual experiments have proved that even listeners with no hearing loss can also take benefit of visual information to correctly perceive speech, especially when the acoustical environment is degraded (background noise, other speakers). It turned out from these experiments that the perception of speech by humans is always higher when both modalities, audio and video, are used. This property of speech for being intrinsically bimodal has been employed to implement automatic Audio-Visual Speech Recognition (AVSR) system. The choice of what are the best visual features to represent speech and how to combine them with acoustical data are still open questions.

In a recent article, Reveret et al. propose a representation of facial movements for speech production, based on a set visual speech parameters. Using visual tracking, these parameters can be extracted from a video sequence of a speaker. The goal of this project is to implement an Audio-Visual Speech Recognition system that combines acoustical information and these visual speech parameters as visual information.
 

Here is what you will need to do

Background Deliverables Evaluation
Based on the report turned in to the sponsor of the project by the due date.