Keeping in mind the strength of temporal leaning exhibited by HMM and the highly stochastic behavior of the human gestures, authors have used them as classifiers for the gestures. For the computational simplicity, authors have used discrete HMMs and for that they have used a pre processing technique in which the continuous stream of data is discretized using a Vector Quantization of series of short time FFT approach which is popularly used in speech recognition community.
This approach is basically a sampling technique which required windowing function to capture the prominent (Dominant) frequencies in the window and use them as the features for training the HMM’s. The vector quantizers then encodes the vector and a code book is formed where each vector is represented by a single index. As a new gesture is provided, the features are matched to the codebook entry and assigned to the code book with the lowest distance in least square norm sense. The interesting part of such a clustering (in spectral domain) is that it is not task specific and can be easily applied for many more recognition domains where features are consistent in nature over time, like Handwriting Recognition, Face Recognition etc.
As far as the implementation is concerned, they have used a modified form of HMMs called Bakis HMMs which move from a given state to either same state or state that is within next two states. This ensures that the gestures to be classifieds are simple sequences of motions and non cylindrical in nature.
In order to verify their classification rates, authors have used a confidence measure which measures the misclassification rates. Their results are very impressive as in one sample trial, with just 2 examples, they had 1% error rate which dropped significantly to 0.1% after 4 examples. In another sample trial, the error rate dropped down to 0 after 6 examples from 2.4 % with 2 training examples.
Discussion
This paper presents a new approach of dealing with the gestures when we need to train a system on fly. The method is straight forward and is pretty accurate with small online training. I liked their approach where they tried to find a relation ship between the HMMs modeled for speech and that for gestures. I believe, it makes sense considering the fact that both are temporal approaches in practice. However, in spectral domain, we do have a good chance of noise addition which may lead to wrong selection of frequencies during sampling. Also with windowing, there is another problem associated, which is called leakage. Ideally, after FFT, the function should have 0 at the values close to ±ώ, however, windowing causes the waveform to have non zero values (of significant magnitude ) at the frequencies that are closer to ±ώ and also small magnitude non zero values at frequencies away from ±ώ. This leads to unwanted interferences which may affect the waveform and the spectral information. I am not sure; if this is affect is going to affect the formation of the input vector obtained from preprocessor.
Also, believe the introduction of acceleration based method to segment the gestures is a nice way of dealing with the complexities of natural interaction and understanding as it would be very inconvenient for the people to pause between gestures as it is not natural to them.
Over all I liked the approach and the insight provided as I was not aware of the usage of Spectral HMMs before in any other domain except speech.
(FYI: A good reference for spectral analysis of speech (Click here))