Friday, August 10, 2018

Recognize Gestures of NGT (Dutch Sign Language)

This paper presents a vision-based technique to recognize gestures of NGT (Dutch Sign Language). Many NGT signs contain periodic motion, e.g repetitive rotation, moving hands up and down, left and right and back or forth. As such, it is required that the motion be captured in 3D. For this purpose, the user is bounded to be in a specific region with a pair of stereo cameras having wide angle lenses and no obstruction between the camera and the hand so that gestures are obtained clearly. The complete setup and architecture of such a system is shown in the figure below.

Since hands are the region of interest it is very important that their motion be tracked. As such authors have proposed a segmentation scheme to segment out the skin from the image frames by using skin models which are trained by training them on the positive and negative skin examples specified by the users. Skin color is modeled by the 2D Gaussian perpendicular to the main direction of the distribution of the positive spin in the RGB space, which is obtained by a sampling consensus method called RANSAC. In order to compensate for the chrominance direction uncertainty, Mahalanobis distance of the color to the brightness is measured and divided by the color intensity. This provides a kind of normalized pixel intensity which takes care of very bright regions on the skin cause by varied light sources and their different directions

In order to track the gestures, it is important to follow the detected blobs (head and hands) and their movement. As such, various frames per second are captures to record the sequence and track the blobs in each frame by using the best template match approach. It is important though that the right blob be recognized as right and the left blob be recognized as left. Also, occlusion may disrupt the gesture recognition and to prevent this if there is occlusion for more than 20 frames, blobs are reinitialized. For reference, a synchronization point is selected for each which is not considered for training. Various features like 2D angles, the tangent of the displacement between the consecutive frame, upper lower, side angles etc are extracted from each frame and used as features. The time signal formed by the measurement of the features is wrapped on to the reference point by Dynamic Time Warping which ensures that we obtain a list of time correspondences between new gesture and the reference sign. For classification, a Bayesian classifier, based on the independent feature assumption is trained for each gesture. The classifier is trained on a data of 120 different NGT signs performed by 70 different persons using 7 fold cross-validation. Their system achieved an accuracy of 95% true positives and the response of their system was just 50 ms which ensures real-time application.

Discussion:

This is a vision based paper throwing in lot of image processing stuff and classification is basically recognizing of the patterns created by hand and head observed in image frames. Since they are tracking blobs, I believe they have to specify the synchronization point for each person who is going to use the system. Also, if there is a movement of the head while performing gestures, it might affect certain gestures as head may be confused with hand blob. Also, there should be no object in the background which resembles a blob as it could cause miss- recognition. Also, the system requires that the person is constrained in an arrangement which may not be comfortable. I wonder if they have any user study for their system while obtaining data. As far as image processing stuff is concerned, they have done pretty much good job in taking care of the chromaticity changes by training the system on various illumination conditions and using normalized chromaticity value for each color.

Wednesday, April 23, 2008

Toward Natural Gesture/Speech HCI: A Case Study of Weather Narration

Summary:

This paper presents a gesture recognition system which can capture the motive behind the gesture rather than the description of the gesture for recognition. This means that if the user is pointing back, he can point back in any intuitive way as he likes. The system consists of the HMM which is trained on the global features of the gesture taking head as the reference (as shown in the diagram below).Color segmentation was used to extract out the skin which was tracked using the Predictive Kalman filtering. Authors have also suggested that in certain domains , certain speech follows the particular gesture which can be used as another feature to remove ambiguity in recognition.Authors observed that in their weather speaker domain, 85% of the time meanigful gesture is accompanied by a similar verbal word.. Using this knowledge, the correctness was improved from 63%-92% and accuracy to 62%-75%.



Discussion:


I liked that hey have pointed out the correlation between the words and the accompanying gestures. I think such a system which uses context (derived from words) would be a state of art development in the field of gesture recognition. They results (Though I am not convinced about their approach) demonstrate that context can improve accuracy. However, I am not happy with their too simplistic approach as lot of ambiguity can be introduced in such a system because of 3D nature of the problem and recognition using angles in 2D plane. Considering simple weather example, this may work to some extent, but results show that there is not much to expect from such a system in more complex situation.

3D Object Modeling Using Spatial and Pictographic Gestures

This paper presents the merger of graphics and haptics to create objects in the augmented reality setup. The authors have given a brief introduction of the superquadrics (super ellipsoid) and how they are used to create various shapes in the graphics. In the paper they have used the hand gestures to map to certain operations like tapering, twisting, bending etc which are mapped to the graphic object using various functional mappings. Using the Self organizing maps, the system allows the users to specify their own gestures for each operation. In the example, the creation of primitive shapes is the first step. The primitive shapes are then used to build up the complex shapes like the teapot and the vase.The system was also tested to see if the non existing objects can be created from hand and how precisely the real objects can be modeled.

The system consists of two cybergloves , aPolhemus tracker and a large 200 inches projection screen where the objects are projected. Using the special LCD shutter eye glasses, user can see the objects and manipulate with them using his hands and pre-specified gestures.


Discussion:

I am not sure how easy or difficult is it to manipulate with the objects in the augmented reality.I believe if there would have been some haptic feedback, it would have been more natural.However, I liked the paper because it was different and had very practical application. Such a system can also be used by multi-user design team to create designs in the virtual world by interacting with the object together.

Device Independence and Extensibility in Gesture Recognition

In this paper a multi-layer recognizer is presented which is claimed to be device independent recognition system that can work with different gloves. The data from the gloves is converted into the posture(like is bend, is open ), temporal (Changes in the shape of the posture over time),gestural predicates(Motion of the postures). The predicates basically contain the information about the states of the fingers . Thus a 32 dimensional boolean vector is obtained through a feed forward neural network which is used to match with the Template.The template matcher works by computing the Euclidean distance between the observed predicate vector and every known template. The template that is found to be the shortest distance from the observed predicate vector is selected as the gesture with certain confidence value. Confidence values are basically weights which are used to weigh certain predicate so that the information from all predicates is equally utilized (as some predicates may give less information while others may give more).

They have tested their system on ASL (excluding letter J and Z) and have reported the accuracy in high 50's and mid Sixties by varying the sensors availabe in 22 sensor Cyber glove. They also reported that there was increase in accuracy when bigram context model was added to the recognition.



Discussion:

There is nothing great in the paper. using the predicates is an easy way but there is always scalability issue when users change. This method will not be suitable for multi-user environment. Even a single user would not be able to make this system to get higher accuracy because of intrinsic variability in the human gestures. That is why accuracy is very less. Also, using the less sensor data from similar glove does not make glove a different glove. As I understand different glove means altogether different sensors and architecture. I do not agree with their claim of device indpendence as they have projected in the paper.

Wednesday, April 16, 2008

Discourse Topic and Gestual form

In the paper authors have tried to find out the extent to which the gestures made are dependent on the topic (i.e. are speaker independent) and to what extent the gestures made are dependent upon the user. The presented frame work is a Bayesian framework which utilizes the unsupervised technique for quantifying the above mentioned extent. They have used vision based approach for the same.

Their approach utilizes the visual features that describe motion based on so called spatiotemporal interest points, which are actually a high contrast image regions like corners and edges that undergo complex motion. From the detected points, the visual, spatial and kinematics characteristics are extracted to frame a huge feature vector on which PCA is applied to reduce the dimensionality. The reduced dimensionality feature vector is used to form a mixture model and a code book is obtained. The dataset consists of the 33 short videos (duration 3 minutes) of the dialogues involving 15 speakers describing one of the five pre determined topics. The user set age ranges from 18 -32 and were native English speakers.

In the experiment each user was allowed to talk to another speaker though they were not asked to make gestures. The scenarios involve describing, “Tom and Jerry” and mechanical devices (piston, candy, pinball machine and a toy). It was observed that of the recorded gestures, 12% of gestures were classified as topic-specific with correct topic labels and with the corrupted labels this value dropped to 3%. This indicated that there exists a connection between discourse topic and gestural form which is independent of speaker.

Discussion:

12% gestures are classified topic specific if given with correct labels and 3% are classified correctly with corrupted labels. And to convey this message instead of using simple human judge to take a note, complex machine learning with vision based approach was used which they admitted must be bit corrupted because of computer vision errors. I am not very impressed with the paper.

Monday, April 14, 2008

Glove-TalkII - A Neural-Network Interface which Maps Gestures to Parallel Formant Speech Synthesizer Controls

This paper presents a gesture based speech synthesizer called glove-talk-II that can be used for communication. The idea behind the whole work is that by recognizing the subtasks involved in speech generation (tongue motion, alphabet generation, sound used), they can be mapped to suitable actions mapped to the sensor devices The device consist of a cyber glove (18 Sensor) for gestures, a polhemus for emulating the tongue motion (up down) for consonants, a foot pedal for sound variation. The input from all the three sensors is fed to 3 different neural networks trained on the data for that particular subtask.

There are three networks used which are: Vowel/Consonant network, Consonant network and Vowel network. A Vowel/consonant network is responsible for recognizing if the emit the vowel or consonant sound based on the configuration of the user hands. Authors have claimed the network can interpolate between hand configurations to produce smooth but rapid transitions between vowels and consonants. The network trained is a 10-5-1 feed forward network with sigmoid activation function. They have trained it using 2600 examples of consonants and 700 examples of vowels. Training data was collected from the expert user while vowel data was collected from the same user by requiring him to move hands up and down. The test consists of 1614 examples and MSE was reduced to 10^-4 after the training.

The vowel network is a 2-11-8 FF network and hidden units are the RBF’s which are centered to the respond to one of the cardinal networks. Outputs are 8 sigmoid units representing 8 synthesized control parameters. The network was trained on 1100 examples with some noise added. The network was tested on 550 examples and gave MSE of 0.0016 after training.

The consonant network is a 10-14-9 network with 14 hidden units as normalized RBF units. Each of the RBF unit is centered at Hand configuration determined from the training data. The 9 output units are responsible for the nine control parameters of the formant synthesizer. The network is trained on 350 approximants, 1510 fricatives and 700 nasal scaled data. Test data consists of 255 approximants, 960 fricatives and 165 nasal data giving MSE of 0.005.

They tested their product on an expert pianist and with 100 hours of training, his speech with glove-talk-II was found to be intelligible though still with errors in polysyllable words.




Discussion:

I just wonder how difficult / easy is to control your speech with foot pedal, data glove and polhemus all of which need to move in synchronization for correct pronunciation. Also with 100 hours of training results were intelligible and I doubt if this system is comfortable at all. But for a paper 10 years old, it is a decent first step and a different approach.

Sunday, April 13, 2008

Feature selection for grasp recognition from optical markers

This paper presents the feature selection methodology to select the relevant features from a large feature set for better recognition results. The aim of the project is to obtain a gesture class for an input using the minimum features. In the approach, the authors have used several markers on the back of the hand and used the calibrated cameras to track the markers in the controlled environment. They have supported the use of this method because this approach doesn’t affect the natural grasping of the subject. Also, it doesn’t affect the natural contact with the object.
In their experiment they have placed markers on the back of the hand and used the local coordinate system which is invariant to the pose.

In order to classify the gestures they have used a linear logical regression classifier which is used for subset selection in a supervised manner. They have used three markers a t a time as a single feature vector and then used across validation with subset selection with the number of features to arrive at the best feature set. They have tried both the backward and the forward approach for the subset selection and observed that the error between the two approaches is just 0.5%.

In their experiment, they have used a 90 dimensional vector for a hand pose representing the 30 markers on the hand. Their domain for the experiment is the daily functional grasps shown in the figure below.

They collected their data using the 46 objects which were grasped in a multiple ways. The objects were divided into two sets A and B, A containing 38 objects with the 88 object-grasp pairs and B containing 8 objects with 19 object grasp pairs. They collected data from 3 subjects for set B and 2 subjects for set A. Then they used 2 fold cross validation and with full 30 marker data obtained accuracy of 91.5 % while with 5 marker selected by subset selection they achieved 86% accuracy.

They also evaluated their feature subset on classifier trained on different data using both 5 and 30 marker set. They trained in total 4 classifiers (2 with the data from subject 1 and subject 2 respectively from object set A, 3rd one with the combined data of subject 1 and subject 2 from object set A and fourth on combined set A+B from subject 1 and subject 2).

They observed that the accuracy was sensitive to weather the data from the subject is used for training or not. With data included the accuracy for the user was 80-93% in reduced space (5 markers) and between 92-97% (with 30 markers). With totally new user accuracy tested on single user trained data, accuracy was abysmal 21-65% for reduced marker space. However the retain-ment of accuracy was over 100% in all cases.

In their analysis they also observed that the grasps for all the three subjects did well for cylindrical and pinch grasps however spherical and lateral tripod performed poorly because of the similarity between three finger precision grasps.


Discussion:

This paper has nothing new except the new complex linear logical regression classifier. Their analysis is also based on small user set and hence cannot be generalized to most of the cases. I think with more users with different hand sizes, it would have been a better paper. Also, I don’t understand why many papers claim that the accuracy increases with the samples from the user included in the training set. I think it is very simple and easy to digest fact which needs no explanation. Also, it would have been nice if they could have mapped the relation between the grasping patterns of the users, which might have been used for making more generalized set of features for a given set of users sharing similar patterns. The work is very similar to our paper on sketch.