Sunday, February 17, 2008

A Survey of Hand Posture and Gesture Recognition Techniques and Technology

This paper is the survey paper dealing with the survey of hand posture and gesture recognition techniques used in the literature. Chapters 3, deals with the survey of various algorithmic techniques that have been used over the years for the purpose. Various approaches in the literature have been classified by the author into 3 major categories:

  1. Feature Extraction, Statistical models.
  2. Learning Algorithms.
  3. Miscellaneous Algorithms.

Feature Extraction, Statistical Models:

This category of methods deals with the extraction of the features in form of mathematical quantities from the available data which is captured through sensors (gloves) or images. The method is further classified into sub categories like:

Template Based: In this approach the data obtained is compared against some reference data and using the thresholds, the data is categorized into one of the gestures available in the reference data. This is a simple approach with little calibration but suffers from noise and doesn’t work with overlapping gestures.

Feature extraction: This approach deals with the extraction of the low level information from the data and combine the information to produce high level semantic feature information which can be used to classify gestures/ postures. The methods in this sub category usually deal with capturing the changes and measuring certain qualities during those changes. The collection of these values is used to label a posture which can be subsequently extended to a gesture. This method however suffers from heavy computational cost and also there should be a specific sequence that should frame a gesture else this method fails.

Active shape Models:

Active shape models deal with the image based gesture recognition systems in which they place a contour on the image which is roughly the shape of the feature to be tracked. The contour is then manipulated by moving it iteratively towards nearby edges that deforms the contour to fit the feature. This suffers from the drawback that it can capture those gestures that can be performed by postures requiring open hand. Also there is very little work in this direction. However with limited gestures meeting the open hand criterion, this method has been found to work in real-time. Also stereo cameras cannot be used.

Principle component analysis:

This method is basically the dimension reduction method in which the significant eigen vectors (based on the eigen-values) are used to project the data. This approach captures the significant variability in the data and thus can be used to identify the gestures and postures in the vision based system. Though this method can be exploited for glove based approaches also, but till that time (1999) only vision based techniques have been exploited. This method suffers from a drawback that there should be variance in the at least one direction. If variance is uniformly distributed in the data, it will not yield the relevant Principle vectors also, if there is noise, PCA would consider it as a significant bias too. Besides this method suffers from scaling in hand size and position, which can be taken care by normalization. Even then, this method is user dependent.

Linear finger Tip model:

This method requires special markers on the finger tips and then segmenting the finger tip motion form the scene image .This motion is analyzed for the possible gesture. This method works well for simple gestures and deals with the initial and final position with good recognition. However, the system cannot work in real time and recognizes a small set due to limited possible finger motions. Also curvilinear motions are not taken into account.

Cause analysis

This method is based on the interaction of humans with the environment and capturing the body kinematics and dynamics. This also suffers from limited gesture sets and no orientation information can be used. Besides, this system cannot work in real-time.

Learning Algorithms

These are machine learning algorithms that deal with the learning of the gesture based on the data manipulation and weight assignment. The popular techniques in this sub category are:

Neural Networks:

This method is based on modeling of the human nervous system element called neuron and its interaction with the other neurons to transfer the information. Each node (neuron0, consists of and the input function which computes the weighted sum and the activation function to generate the response based on the weighted sum. There are two types of NN, feed forward and recurrent. The methods based on this approach deal with the problem of heavy training and computation cost involved offline for training. Also for complex systems, such a model could be very complex. Also addition of each new gesture/posture requires complete retraining of the network.

Hidden Markov Model:

This method has been widely exploited for temporal gesture recognition. An HMM consists of states and state transitions with observation probabilities. For watch gesture a separate HMM is trained and the recognition of the gesture is based on the generation of maximum probability by a particular HMM. This method also suffers from training time involved and complex working nature as the results are unpredicted because of the hidden nature. For the gesture recognition, baki’s HMM is commonly used.

Instance based learning:

Instance is the vector of features of the entity to be classified. These techniques involve the computation of the distance between given data vector and instances in the database. This method is very expensive and for instance recognition, we need to maintain a large database and computation has to be performed for each instance to be recognized even when a given instance is provided for re-recognition.

Miscellaneous Techniques:

The Linguistic Approach

This method uses the formal grammar to represent the hand gestures and postures however limited. This method involves simple gestures requiring the fingers to be extended in various configurations which are mapped to the formal grammar specified by specific tokens and rules. The system involves tracker and glove. This system has poor accuracy and very limited gesture set.

Appearance based models:

These models are based on the observation that the humans are able to recognize the fine actions from the very low resolution images with little or no information about the 3D nature of the scene. These methods involve measurement of the regional statistics of the particular region of the image based on intensity values of the region. This method is simple but is unable to capture fine details in the gesture.

Spatio-temporal vector Analysis

This method is used to track the movement of the hand in the images of the scene and track the motion in the sequence of image. The information about the motion is obtained by the derivatives and it is assumed that under static background, hand motion is the fastest changing object of the scene. Then using the refinement and variance constraint flow field is refined. This flow field is captures the characteristics of the given gesture.

Application of the method:

This sections deal with introduction to the application of the posture and gesture in various domains like:

Sign language: where high accuracy of 90 % have been obtained under some constraints.

Gesture to speech: in this which hand gestures are converted to speech.

Presentations: Hand motion and gestures are used to generate presentations.

Multimodal interaction: Hand gesture and motion is incorporated along with speech to generate better user interfaces.

Human Robot interaction: Hand gestures are used as natural mode to control robots:

Other domains include, Virtual environment interaction, 3D modeling in virtual environment, television control.


This paper presents survey of the various approaches used for gesture recognition with beautiful classification into three approaches for recognition which have been further elaborated with methods involved in each approach. Most of the paper deals with the works done prior to 1999 and the advancements in the past 9 years are definitely worth exploring considering advancements in computational power, better sensors , gloves, vision capturing devices and human touch interfaces. It was interesting to note that the problems about the techniques that we discuss in the class have been known to the research community since last 9 years but still only few have been resolved and that too partially. This is mainly because of the complexities involved in the gestures and also absence of any robust segmentation approach.


Brandon said...

yeah, i too would have liked to have seen a segmentation section. my guess is that segmentation wasn't an issue pre-1999 since recognition of pre-segmented gestures and postures still wasn't great.

Joane said...

From this survey, u think which approach is better ? which techniques is best for extract the hand region and the recognition?