Progress Report: Lipreading


Group Member:

Currently we have implemented the initial stages (mostly data processing, and some initial analysis) of using the muscle movement around the mouth region to perform recognition of spoken word.

The idea behind this approach is that the muscle movement of the mouth that alters the shape of the lips will be more general for speech recognition than that of straight mouth shape matching. That is because, for different people, the shape of the mouth might vary, but the muscle motion to produce a sound will be the same based on the physiology of spech production. Therefore, it is possible to identify the spoken word from the motion data around the mouth region.

Our Overall Approach (see project proposal):

  1. Data Collection
  2. Data Processing
  3. Feature Analysis and Pattern Classifying
  4. Performace measurement.


PROCEDURES

Data Collection
We used the O2Cam on the SGI O2 machines to capture our sample sequences. Because it is necessary that there is no frames loss in the movie sequence, we have to reduce the sampling rate to 10 frames per second. For space considerations, we also reduce the image size to (160x120) - quarter of the standard. We store the movie sequence in uncompressed QuickTime format, because we don't want to introduce unnecessary error for the analysis that might be the result from the compression.

We had each person pronounce the sounds a,e,i,o,u in sequence, with the neutral mouth position in between the vowels as well as in the beginning and the end of the whole sequence. We recorded two such sequences per person. To simplify the program, we ask the person not to move their head while they pronouncing these sounds. We adjust the camera such that the head of the person is not at an angle, and that the area in view is the lower part of the face (a little bit of the nose and the full chin).

Currently we have three volunteers, we hope to have more samples as time goes by (as we refine the data processing method).


Data Processing
Before any conversion of data into other formats, we used moviemaker to chop down each vowels. This probably can be done programmatically since we do have the neutral sequence between each sound. The reason for that is optic-flow data at these intermediate frames are small. Thus, we can definitely set a threshold value to segment these sounds.

Because we need single channel raw data as the program input, we had to first extract the data into rle images from the quicktime movie, then convert these rle images to grayscale (one channel) and convert these images to raw file and cat them into one. We use programs on SGI machine to accomplish those tasks. They are: dmconvert, tobw, rletoraw. If only there is a way to convert the movie file directly to gray valued raw data...

Once we have these sequence of one channel raw data, we piped the data into the optic-flow code provided on SGI to generate the flow data. From there, we compute the average optical flow of 4 defined window regions and create 8 vectors (4 regions, average optical flow in the x,y direction) of time as data for analysis.

We manually define four rectangular windows around the mouth (above, below, left and right ) for each sequence and each person. We tried to identify the regions automatically using matlab. But it's difficult to tell the edge of the lips apart from other edges such as nose, background, etc. apart by using edge detection technique for the segmentation. Since we are more concerned about the data analysis and recognition task, we drop the automatic tracking part and manually segment the image.

Once we got the position and the size of the four windows, we can compute the average values of optical flow vectors in each of the windows. That is, we have two velocity components (u,v) for each of the four windows, and the entire action is described by eight feature vectors.

Initial Results
From the flow data we evaluated from the last stage, we can get temporal curves for each feature vectors for each of the sequence. An example of the template pattern for the vowel /i/ and /u/ are shown below. In each diagram there are eight curves, each represent the feature vector associated with the pronounced vowel

template pattern for /i/

template pattern for /u/

As we can see from the flow data, different person pronounce the word differently, even for the same person, the pattern is not consistent all the time. But we can still identify some similar pattern. The feature vectors have different length for the same vowels because different person, or even the same person pronounce the same word differently at each time. In order to analyze the data we first resample the data sequences to get the same length for the same word. We use the shortest sequence length as the standard length and downsample other sequences to it. The result is shown in the following figures. In each diagram the feature vectors are plotted through the whole sequence including the vowels /a/, /e/, /i/, /o/, and /u/. The frame length for each vowels is 12 frames.

Resampled data

%Flow_2.jpg

What we would do next is to use principal component analysis technique to analize the data we get so far, define the main feature vectors and perform word matching. Also, we will get more training data to refine our results.