Tracking of hands:
Up to now we have achieved a fairly robust tracking of human hands in adverse light conditions.
We maintain several different images: frame by frame differencing, blobs of skin color, motion energy and motion history. Using statistical techniques we can identify the regions in the image that have skin -like colors. However some objects like doors have the same color. In order to filter those out we perform a binary AND between the motion history image and the skin-color image. This gives us only the blobs of skin color that moved recently. Then we perform dilation on the resulting image and AND it again with the skin color image. This operation is performed 5 times which was empirically shown to be enough to recover the whole blobs of the head and the two hands. A median filter is applied to the result to get rid of any single pixel noise.
Next we perform a labeling of the blobs that have remained. Simultaneously
with this we calculate the size of each blob. The three largest blobs are
assumed to be the head and the two hands. The centroids of these are tracked
overtime.
Recognition:
The main problem that occurs in recognition is that it is hard to know when a given gesture starts and ends. To ease this task we have integrated in our code the input from a commercially available speech recognition package (DragonDictate). In this way we can provide additional input to the algorithm and ease the problem of detecting when a given gesture starts. Right now the user is required to say something during or after he performs a gesture. For example the user will make a pointing gesture to his left or right side and say "fetch". Presumably there is a ball in that direction which the robot is supposed to fetch.
We perform gesture recognition by training a Neural Network(NN) on the motion energy of several gestures. When the user issues the voice command the vision system gets the current motion energy profile and runs it through the already trained neural network which will give us the correct gesture that was performed. If this was a "point to right" or "point to left" gesture then the tracking mode is switched from tracking hands to tracking a ball and the appropriate movement commands are sent to the robot.
The neural network was trained on three gestures (or to be specific
on their motion energy). We have achieved 100% accuracy for the gestures
that we have trained the network on. The result is not that impressive
because the number of gestures is small and the gestures were distinct
enough. A better way to do it will be to train the network on more gestures
and with their motion history images. However, since we are not interested
in static type of recognition we did not try to train the network on more
gestures. This experiment was done to prove that we can have a complete
system that combines vision, NN, speech, and robot commands. Later we will
replace the recognition module with a learning module based on Hidden Markov
Models (HMMs).
One problem that may arise is that we still don't know how to make the
HTK package accept live input from network connection or from a pipe. The
manual talks only about working with direct speech input. If it is not
capable of doing this the recognition will have to be off line. For now
we have a backup plan to use an existing C++ library for performing HMM
training.
Some more work needs to be done to improve the tracking of the
hands over time. One way is to use simple averaging . A more complicated
approach will be to use Kalman filters for each blob. We may or may not
do this depending on the time we have left after we have done everything
else.