Currently we have implemented the initial stages (mostly data processing, and some initial analysis) of using the muscle movement around the mouth region to perform recognition of spoken word.
The idea behind this approach is that the muscle movement of the mouth that alters the shape of the lips will be more general for speech recognition than that of straight mouth shape matching. That is because, for different people, the shape of the mouth might vary, but the muscle motion to produce a sound will be the same based on the physiology of spech production. Therefore, it is possible to identify the spoken word from the motion data around the mouth region.
Our Overall Approach (see project proposal):
We had each person pronounce the sounds a,e,i,o,u in sequence, with the neutral mouth position in between the vowels as well as in the beginning and the end of the whole sequence. We recorded two such sequences per person. To simplify the program, we ask the person not to move their head while they pronouncing these sounds. We adjust the camera such that the head of the person is not at an angle, and that the area in view is the lower part of the face (a little bit of the nose and the full chin).
Currently we have three volunteers, we hope to have more samples as time goes by (as we refine the data processing method).
Data Processing
Before any conversion of data into other formats, we used moviemaker
to chop down each vowels. This probably can be done programmatically
since we do have the neutral sequence between each sound. The reason
for that is optic-flow data at these intermediate frames are small.
Thus, we can definitely set a threshold value to segment these sounds.
Because we need single channel raw data as the program input, we had to first extract the data into rle images from the quicktime movie, then convert these rle images to grayscale (one channel) and convert these images to raw file and cat them into one. We use programs on SGI machine to accomplish those tasks. They are: dmconvert, tobw, rletoraw. If only there is a way to convert the movie file directly to gray valued raw data...
Once we have these sequence of one channel raw data, we piped the data into the optic-flow code provided on SGI to generate the flow data. From there, we compute the average optical flow of 4 defined window regions and create 8 vectors (4 regions, average optical flow in the x,y direction) of time as data for analysis.
We manually define four rectangular windows around the mouth (above, below, left and right ) for each sequence and each person. We tried to identify the regions automatically using matlab. But it's difficult to tell the edge of the lips apart from other edges such as nose, background, etc. apart by using edge detection technique for the segmentation. Since we are more concerned about the data analysis and recognition task, we drop the automatic tracking part and manually segment the image.
Once we got the position and the size of the four windows, we can compute the average values of optical flow vectors in each of the windows. That is, we have two velocity components (u,v) for each of the four windows, and the entire action is described by eight feature vectors.
Initial Results
From the flow data we evaluated from the last stage, we can get temporal
curves for each feature vectors for each of the sequence. An example of
the template pattern for the vowel /i/ and /u/ are shown below. In each
diagram there are eight curves, each represent the feature vector
associated with the pronounced vowel
template pattern for /i/
template pattern for /u/
As we can see from the flow data, different person pronounce the word
differently, even for the same person, the pattern is not consistent
all the time. But we can still identify some similar
pattern.
The feature vectors have different length for the same vowels because
different person, or even the same person pronounce the same word
differently at each time. In order to analyze the data we
first resample the data sequences to get the same length for the same
word. We use the shortest sequence length as the standard length and
downsample other sequences to it. The result is shown in the
following figures. In each diagram the feature vectors are plotted through
the whole sequence including the vowels /a/, /e/, /i/, /o/, and /u/. The
frame length for each vowels is 12 frames.
Resampled data
%Flow_2.jpg
What we would do next is to use principal component analysis technique
to analize the data we get so far, define the main feature vectors and
perform word matching.
Also, we will get more training data to refine our results.