Wednesday, January 30, 2008

Online Interactive Learning of Gestures for Human/ Robot Interfaces

In this paper authors have presented an interesting gesture recognition system that can be trained on fly. Keeping in mind the ability of Humans to learn the new gestures by looking at the human teacher, authors have tried to develop a similar system, where a human teacher can make robot to learn the new gesture on fly rather than depending upon the time consuming offline training. As per the authors, even 2 gestures were sufficient to teach new gesture to the gesture recognition system. This is quite close to human way of teaching and interacting. This approach is demonstrated by the authors, by developing a gesture recognition system that can recognize 14 alphabets of the sign language. For this system, they have used 18 sensors Cyber Glove as a feature capture device. They have chosen 14 alphabets such that there is little ambiguity between alphabets and they have avoided use of 6D Polhemus sensors which account for orientation and position information.

Keeping in mind the strength of temporal leaning exhibited by HMM and the highly stochastic behavior of the human gestures, authors have used them as classifiers for the gestures. For the computational simplicity, authors have used discrete HMMs and for that they have used a pre processing technique in which the continuous stream of data is discretized using a Vector Quantization of series of short time FFT approach which is popularly used in speech recognition community.


This approach is basically a sampling technique which required windowing function to capture the prominent (Dominant) frequencies in the window and use them as the features for training the HMM’s. The vector quantizers then encodes the vector and a code book is formed where each vector is represented by a single index. As a new gesture is provided, the features are matched to the codebook entry and assigned to the code book with the lowest distance in least square norm sense. The interesting part of such a clustering (in spectral domain) is that it is not task specific and can be easily applied for many more recognition domains where features are consistent in nature over time, like Handwriting Recognition, Face Recognition etc.

As far as the implementation is concerned, they have used a modified form of HMMs called Bakis HMMs which move from a given state to either same state or state that is within next two states. This ensures that the gestures to be classifieds are simple sequences of motions and non cylindrical in nature.

In order to verify their classification rates, authors have used a confidence measure which measures the misclassification rates. Their results are very impressive as in one sample trial, with just 2 examples, they had 1% error rate which dropped significantly to 0.1% after 4 examples. In another sample trial, the error rate dropped down to 0 after 6 examples from 2.4 % with 2 training examples.

Discussion

This paper presents a new approach of dealing with the gestures when we need to train a system on fly. The method is straight forward and is pretty accurate with small online training. I liked their approach where they tried to find a relation ship between the HMMs modeled for speech and that for gestures. I believe, it makes sense considering the fact that both are temporal approaches in practice. However, in spectral domain, we do have a good chance of noise addition which may lead to wrong selection of frequencies during sampling. Also with windowing, there is another problem associated, which is called leakage. Ideally, after FFT, the function should have 0 at the values close to ±ώ, however, windowing causes the waveform to have non zero values (of significant magnitude ) at the frequencies that are closer to ±ώ and also small magnitude non zero values at frequencies away from ±ώ. This leads to unwanted interferences which may affect the waveform and the spectral information. I am not sure; if this is affect is going to affect the formation of the input vector obtained from preprocessor.

Also, believe the introduction of acceleration based method to segment the gestures is a nice way of dealing with the complexities of natural interaction and understanding as it would be very inconvenient for the people to pause between gestures as it is not natural to them.

Over all I liked the approach and the insight provided as I was not aware of the usage of Spectral HMMs before in any other domain except speech.



(FYI: A good reference for spectral analysis of speech (Click here))

No comments: