Tuesday, February 19, 2008

3D Visual Detection of Correct NGT Production

This paper presents a vision based technique to recognize gestures of NGT (Dutch Sign Language). Many NGT signs contain periodic motion, e.g repetitive rotation, moving hands up and down, left and right and back or forth. As such, it is required that the motion be captured in 3D. For this purpose, user is bounded to be in a specific region with a pair of stereo cameras having wide angle lenses and no obstruction between the camera and the hand so that gestures are obtained clearly. The complete setup and architecture of such a system is shown in figure below.

Since hands are the region of interest it is very important that their motion be tracked. As such authors have proposed a segmentation scheme to segment out the skin from the image frames by using skin models which are trained by training them on the positive and negative skin examples specified by the users. Skin color is modeled by the 2D Gaussian perpendicular to the main direction of the distribution of the positive skin in the RGB space, which is obtained by a sampling consensus method called RANSAC. In order to compensate for the chrominance direction uncertainty, mahalanobis distance of the color to the brightness is measured and divided by the color intensity. This provides a kind of normalized pixel intensity which takes care of very bright regions on the skin cause by varied light sources and their different directions

In order to track the gestures, it is important to follow the detected blobs (head and hands) and their movement. As such, various frames per second are captures to record the sequence and track the blobs in each frame by using the best template match approach. It is important though that the right blob be recognized as right and the left blob be recognized as left. Also occlusion may disrupt the gesture recognition and to prevent this, if there is occlusion for more than 20 frames, blobs are reinitialized. For reference a synchronization point is selected for each which is not considered for training. Various features like 2D angles, tangent of the displacement between consecutive frame, upper lower, side angles etc are extracted from each frame and used as features. The time signal formed by the measurement of the features is wrapped on to the reference point by Dynamic Time Warping which ensures that we obtain a list of time correspondences between new gesture and the reference sign. For classification, a Bayesian classifier, based on the independent feature assumption is trained for each gesture. The classifier is trained on a data of 120 different NGT signs performed by 70 different persons using 7 fold cross validation. Their system achieved an accuracy of 95% true positives and the response of their system was just 50 ms which ensures real-time application.

Discussion:

This is a vision based paper throwing in lot of image processing stuff and classification is basically recognizing of the patterns created by hand and head observed in image frames. Since they are tracking blobs, I believe they have to specify the synchronization point for each person who is going to use the system. Also, if there is movement of head while performing gestures, it might affect certain gestures as head may be confused with hand blob. Also, there should be no object in the background which resembles a blob as it could cause miss- recognition. Also, the system requires that the person be constrained in an arrangement which may not be comfortable. I wonder if they have any user study for their system while obtaining data. As far as image processing stuff is concerned, they have done pretty much good job in taking care of the chromaticity changes by training the system on various illumination conditions and using normalized chromaticity value for each color.

1 comment:

Paul Taele said...

I wasn't a fan of them specifying synchronization point for their system, since it feels a bit like cheating. But your comment about how it handles various states of illumination does seem nice. It feels a bit moot though when it seems to lack robustness when their system was evaluated in some ideal environment. Even though we're not doing vision-based techniques in our projects, I don't feel like I'm missing much.