CS7322: Computer Vision II: Final Project Progress Report



 

Real Time Gesture Recognition on a Mobile Robot Platform

Rawesak Tanawongsuwan

Alexander Stoytchev



 
 

Tracking of hands:

Up to now we have achieved a fairly robust tracking of human hands in adverse light conditions.

We maintain several different images: frame by frame differencing, blobs of skin color, motion energy and motion history. Using statistical techniques we can identify the regions in the image that have skin -like colors. However some objects like doors have the same color. In order to filter those out we perform a binary AND between the motion history image and the skin-color image. This gives us only the blobs of skin color that moved recently. Then we perform  dilation on the resulting image and AND it again with the skin color image. This operation is performed 5 times which was empirically shown to be enough to recover the whole blobs of the head and the two hands.  A median filter is  applied to the result to get rid of any single pixel noise.

Next we perform a labeling of the blobs that have remained. Simultaneously with this we calculate the size of each blob. The three largest blobs are assumed to be the head and the two hands. The centroids of these are tracked overtime.
 
 

Recognition:

The main problem that occurs in recognition is that it is hard to know when a given gesture starts and ends. To ease this task we have integrated in our code the input from a commercially available speech recognition package (DragonDictate). In this way we can provide additional input to the algorithm and ease the  problem of detecting when a given gesture starts. Right now the user is required to say something during or after he performs a gesture. For example the user will make a pointing gesture to his left or right side and say "fetch".  Presumably there is a ball in that direction which the robot is supposed to fetch.

We perform gesture recognition by training a Neural Network(NN) on the motion energy of several gestures. When the user issues the voice command the vision system gets the current motion energy profile and runs it through the already trained neural network which will give us the correct gesture that was performed. If this was a "point to right" or "point to left" gesture then the tracking mode is switched from tracking hands to tracking a ball and the appropriate movement commands are sent to the robot.

The neural network was trained on three gestures (or to be specific on their motion energy). We have achieved 100% accuracy for the gestures that we have trained the network on. The result is not that impressive because the number of gestures is small and the gestures were distinct enough. A better way to do it will be to train the network on more gestures and with their motion history images.  However, since we are not interested in static type of recognition we did not try to train the network on more gestures. This experiment was done to prove that we can have a complete system that combines vision, NN, speech, and robot commands. Later we will replace the recognition module with a learning module based on Hidden Markov Models (HMMs).
 

 

Future Work:

We have already started work on integrating our code with the HTK toolkit for modeling and training Hidden Markov Models (HMMs). We have familiarized ourselves with the capabilities of that package and we think that it it appropriate for our needs. In the next couple of days we will gather more training data and train a separate HMM for each gesture that we would like to recognize.

One problem that may arise is that we still don't know how to make the HTK package accept live input from network connection or from a pipe. The manual talks only about working with direct speech input. If it is not capable of doing this the recognition will have to be off line. For now we have a backup plan to use an existing C++ library for performing HMM training.
 
Some more work needs to be done to improve the tracking  of the hands over time.  One way is to use simple averaging . A more complicated approach will be to use Kalman filters for each blob. We may or may not do this depending on the time we have left after we have done everything else.