In this project we want to analyze the lip motion with its associated spoken words, to develop a lipreading system which uses visual features to recognize speech.
The questions we have and want to solve are, which visual features of the lips carry important speech information, and what is a suitable representation that could more acurately capture the characteristics of the lip motion for the speech recognition. We have several ideas of different approachs:
Speech recognition is a complex problem. Since we couldn't solve the whole problem in this project, we want to get some basic ideas (and apply some of the methods described in paper) behind extracting visual information that will aid in speech recognition when using it in combination with audio input.
To make our problem less complicated, we decided that the goal for this project is for the program to recognize a finite set of vowels. This is mainly because each vowel pronounced produces a fairly distinct mouth shape. We will also put some restrictions to the data that we will be using as well.
To simplify the processing of the data, we've decided that we will have a fixed begin and a fixed ending to the motion sequence for all the mouth shapes. The test person will speak words without moving head.
2. Data Processing
Currently we will be using an optic flow method, as described by the paper
"Automatic Lipreading by Optical Flow" by Kenji Mase and Alex Pentland (IEEE
paper: Special Issue on Computer Vision and Its Applications - June 1990).
That is, calculating the direction of muscle actions around the lip region
by measuring the optic flow data. Then computing the optical flow from each
of the four windows around the mouth and examine the velocity pattern to
define feature vector for each different word.
3. Pattern classifying and try the recognition.
We will then define a similarity function that will help us in recognizing
unknown data. (calculate the feature vector for the unknow data and do some
matching.)
4. Performance measurement.
In order to test if our approach works, we definitely need to obtain more
unknown data (test data other than the training data) to measure the performance. (But, this is the very last part of the project, and
perhaps it might be skipped.)