Wednesday, February 27, 2008
American Sign Language recognition in Game Development for Deaf children
As per the author's there is no ASL engine existing to test the setup so they conducted their experiments using the Wizard of OZ study using a human wizard which would be eventually replaced by the computer latter.Since the system is in nascent stage, they have limited the ASL to just single/ double handed gestures and no facial or other expressions. The vocabulary was chosen such that it was comparable to the system constraints as well as the standards of what is taught in the real class. The system follows push to sign approach which means that the user has to push a button to activate the recognition system and then do the gesture which is recognized.
This system, consists of the colored gloves and the wireless accelerometers which capture the motion data of the hands using the gravitational effects on the X,Y and Z coordinates. All the related hardware was developed in house. Since they are using the different color gloves, they are using the color segmentation approach in which the discriminatory information of the background and the glove color, based on the HSV histogram is used for segmentation .The data received from the vision and sensor based approaches will provided to the trained HMM which then recognizes the sign and triggers the mapped action in the game. The HMM tool kit proposed for the system is the GT2k developed at Georgia Tech. They are using the human observer to prune the responses and label them as correct or incorrect.The system Architecture is shown in the figure to the left.
They have reported their results as the user dependent and the user independent models. In the user dependent models , they obtained accuracy of 93.39% by training on the 90% data and the testing on remaining 10 % data repeating it 100 times. In the user independent models, they have obtained accuracy of 86.28%.
They have reported the success rate of 92.96 % on average in all samples at the word level samples, however they have reported that for the sentence level, their system gave less because the words can be deleted and added which causes less accuracy.
Disussion:
This paper presented a nice system which can teach children with hearing disability to learn the ASL in an interactive way through GAME which makes it more exciting than boring classes. A wizard of oz study ensures that it is understood how children would like to interact with the system and thus gave an idea of what the system should look like and interact.The usage of vision with the simple blue tooth wireless adapters was interesting as it makes them free from wires that may make things messy.Over all it was a nice paper with a nice practical application, but I still donot have GT2k anywhere on line!!!
A Method ofr recognizing a Sequence of Sign Language Words represented ina Japanese Sign Language
In order to identify the borders they have proposed two measures. One is the measure of change of velocity and the other is the measure of the change in direction and hand movements dynamically. The segmentation point is registered if the measure exceeds certain threshold. Since the measure of the hand movement can lead to the different borders because of noise, they proposed using the hand border as one which is closest to the border detected by the change in velocity border. Using this information, gestures can be segmented from the stream of gestures and sent to the recognition system for identification.
In order to identify the hand which is used in the gesture, they calculate several parameters representing the difference between the movements of the right and the left hand. These measures are basically the velocity ratio relative to each other and the difference of the velocity squared value normalized with time of the gesture.These parameters are calculated separately for the left and the right hand and if the value is less than certain threshold for bot hands then both hands are used else one hand is being used.They have used another measure which is based on just the relative velocity of the hand to determine which hand is used in the gesture.
The sequence candidates are generated by evaluating a measure, which takes into consideration if the identified segment is the transition or a word, and only considering the words. The segments identified as words are combined with the segments identified as transitions using a weighted sum and this is used for a sentence.
They evaluated thus system using 100 samples from JSL which included 575 word segments and 571 transition segments out of which 46`(80.2%) of transitions were correctly recognized and 64(11.2%) were misjudged as words.
Discussion:
This paper presented an approach which was similar to what we have been using in sketch i.e taking cues from the speed and direction about the stroke. I am not sure if we can use the velocity cues with much accuracy as normally the gestures are made very fast, however I liked that they have also used the change in hand movements too and then used both of the cues to identify the border. However, I did not like the flow of the paper and I was confused with the language too. It was not very clear how they got the thresholds and if people with different styles and speed can get use the system with same thresholds. Also they have admitted that some of the gestures were mis identified as their system does not take any spatial information which may the error in many gestures.
'
Monday, February 25, 2008
Computer Vision based getsure recognition for an augmented reality interface
It is stated that to interact in the virtual environment, system should be able to select the object by pointing towards it and selecting it by clicking. As such for any such system it is important to these two features as the primary features. To make the pointing gesture intuitive, they are using the index finger as the pointer. They are also using some of the basic gestures, very different from each other, shown in the figure below:
By constraining the users to perform the gesture in one plane they are restricting the problem to 2D though it is 3D in nature and it is stated that after some trails users were able to adapt to the constraint without much difficulty.
As a first step, system involves segmentation of the fingers so that they can be distinguished from the place holder objects. For this they have used color cue to segment out the skin from the other objects. In order to deal with the issues of intensity and illumination changes, they are using the color space which is invariant to illumination changes i.e. normalized RGB space (chromaticity). In this space different objects form different clusters. These clusters are used to frame the confidence eclipse, by measuring the mean and the covariance matrices and distance of the chromaticity of the pixel is measured in terms of mahalanobis distance and thus we obtain different labels for different chromaticity value pixels.The predetermined size blob is then labeled as the hand and small blobs which are actually misclassification objects are discarded. The pixels in the hand blob which are missing are filled up using the morphological operators. To take care of the dynamic range issues, only pixels with certain minimum intensity are considered for the process. On the higher end the pixels which have at least one channel with 255, are discarded.
Since each gesture can be recognized by the number of fingers, they have used polar transformation and the number of concentric circle to measure the number of fingers lying in the each radius. The click gesture is the movement of the thumb and they are using the bounding box measure to determine if the thumb has moved or not by measuring the bounding boxes of the series of frames.
Discussion:
I chose this paper as I thought it would be nice to talk about the role of gestures in the augmented reality. This was a very simple paper and with very simple gestures that they are recognizing. The good part is their hand segmentation approach and some new ideas in term of augmented reality office which have come in some of the discussions. I did not like that though they claimed that they are recognizing 3D gestures, but by constraining users to move in a plane they forced the problem to simpler 2D. I believe that their recognition approach, based on counting fingers cannot work in 3D as the occlusion between the fingers will give ambiguous recognition results. However, I liked the approach they have presented, as by using such a system, interacting in a design meeting would be much interactive and less confusing.
Sunday, February 24, 2008
Georgia Tech Gesture Toolkit: Supporting Experiments in
This tool kit provides the users with the tools for preparation, training, validation and recognition using the HMM. The preparation involves the user to design the appropriate models, determine the appropriate grammar and providing the labeled examples of the gestures to be performed. All these steps require some analysis of the available data and the gestures involved. The validation step involves the evaluation of the potential performance of the overall system. Validations approaches like cross validation and one left out have been used in the paper. In the cross validation, portion of data is used for training and the other part is used for testing where as in the left one out, one data sample is always kept out for testing and the model is iteratively trained on the remaining. Training utilizes the information from the preparation stage to train the models for the each gesture and recognition, based on the HMM’s, is used to classify the new data using the trained models.
Since it is necessary for the system to understand the relevant continuous gestures for any practical application, authors have proposed the use of rule based grammar for the same. With such a grammar the complex gestures can be explained with the set of simple rules.
This toolkit has been used in various projects being undertaken at Georgia tech and authors have brief details of the same. The first application is development of the gesture based system to change the radio stations by performing certain gestures. The data is obtained by the LED sensors. As a gesture is made, some of the LED’s are occluded which provide the information about the gesture. This information is used to train the model which can then be used for the recognition purposes. Authors have claimed a classification of 249 gestures out of 251 gestures by this approach. Another project introduced is the patterned blinked eye based secure entry. In this project, face recognition is coupled with blinking of eye to generate person recognition model. Optical flow from the images was used to capture the blinking pattern. In this model, it was observed that 9 states in left to right HMM were able to model the blinking pattern. This model achieved an accuracy of 89.6%.Another project deals with the integration of computer vision with the sensing devices like accelerometer and other mobile sensors for sensing the motions. For the correct recognition, different color gloves are used. The features are obtained by both the techniques and integrated into combined feature vector representing a given gesture for recognition process. For this project, they have used 5 states left to right HMM, with self transition and two skip states. This project has not been implemented till the paper was published so no information about the results is available. Authors have also mentioned the use of HMM based approach to recognize the working of the workman in the workshop. They are using the vision and sensors to receive the features which are used to recognize the gestures. Since workman are suppose to perform a series of gestures in order, their model keeps track of their moves and reports an error if they miss a gesture. This system according to them has received accuracy of 93.33%.
Discussion:
This paper just provided an overview of a new toolkit for gesture recognition was being developed at Gtech on the top of HTK developed by Cambridge University. Though, after reading the paper, I was happy that they have something ready for gestures, I was disappointed to find no code on the project webpage, which as per the paper should have been made available as early as 2003. The projects where they are applying the technique also don’t look very attractive as with Bluetooth and wireless remotes, changing channels is much easy compared to making gestures. It is quite possible that a slight unintended occlusion can trigger a channel change. I also believe that voice technology is much more superior now for the purpose. Another project of blinking eye based entry was also something that I did not like. It is not very difficult to copy the blinking pattern and also making and remembering the complex blinking patterns is not easy task. (It is torture to eyes if you have some eye infection [J]). With finger biometrics and retinal signatures establishments are more secure. I have an experience of dealing with the people in the workshop, and I know their motions are very much mechanized and measured to meet the fast manufacturing requirements, but still there are still many unintended motions (after all they are humans), which the presented system can interpret as gestures and provide the alarm signal. Also, it will be really troublesome to work in a workshop with accelerometers on your body which can even affect the efficiency.
Well There is nothing much to say about the paper, if this toolkit is available for download some where it will be helpful and good to have a look at it. May be it can save us from some hard core programming.
Wednesday, February 20, 2008
Television Control by Hand Gestures
For the detection of the hands in the image, they have used a normalized correlation measure between the hand template and the image frame. The location of the hand in the image frame would be the region of maximum correlation. For measuring the correlation, orientation information or the pixel information can be used. However, they have found that the orientation information has proved to be a little bit better.For measuring speed, they have used the derivative information. Stationary background is removed by simply taking a running average of the image frames of the scene ans subtracting the frames. This causes removal of stationary objects in the image frame.For ending the control mode, the user just have to close his hand.
In order to save computational cost in searching for the template in the complete image frame,system finds the position of the best match of the current filter and searches in the local region with the other template to find the position and value of the filter giving the best correlation match.
In order to accidentally press a trigger, they have used a threshold based on the time of non activity. Their system is limited by the limited field of view which is 250 for searching trigger and 15 degree for tracking .
Discussion.
This is a simple vision based approach to track the hand and use its motion for controlling the volume of the television set. Since they are using orientation, I believe if some one keeps hand in the direction which is in slanting direction to the principle axis direction of the camera, it will be difficult to use the template match as for template matching the template should match the shape of the object. Also, I believe , considering the complex nature of the current television controls,it would be difficult to use such a remote control as it might be very tiring. Also, i am unable to understand if the distance from the television also affects the recognition as the size of the hands would change.
Tuesday, February 19, 2008
3D Visual Detection of Correct NGT Production
This paper presents a vision based technique to recognize gestures of NGT (Dutch Sign Language). Many NGT signs contain periodic motion, e.g repetitive rotation, moving hands up and down, left and right and back or forth. As such, it is required that the motion be captured in 3D. For this purpose, user is bounded to be in a specific region with a pair of stereo cameras having wide angle lenses and no obstruction between the camera and the hand so that gestures are obtained clearly. The complete setup and architecture of such a system is shown in figure below.
Since hands are the region of interest it is very important that their motion be tracked. As such authors have proposed a segmentation scheme to segment out the skin from the image frames by using skin models which are trained by training them on the positive and negative skin examples specified by the users. Skin color is modeled by the 2D Gaussian perpendicular to the main direction of the distribution of the positive skin in the RGB space, which is obtained by a sampling consensus method called RANSAC. In order to compensate for the chrominance direction uncertainty, mahalanobis distance of the color to the brightness is measured and divided by the color intensity. This provides a kind of normalized pixel intensity which takes care of very bright regions on the skin cause by varied light sources and their different directions
In order to track the gestures, it is important to follow the detected blobs (head and hands) and their movement. As such, various frames per second are captures to record the sequence and track the blobs in each frame by using the best template match approach. It is important though that the right blob be recognized as right and the left blob be recognized as left. Also occlusion may disrupt the gesture recognition and to prevent this, if there is occlusion for more than 20 frames, blobs are reinitialized. For reference a synchronization point is selected for each which is not considered for training. Various features like 2D angles, tangent of the displacement between consecutive frame, upper lower, side angles etc are extracted from each frame and used as features. The time signal formed by the measurement of the features is wrapped on to the reference point by Dynamic Time Warping which ensures that we obtain a list of time correspondences between new gesture and the reference sign. For classification, a Bayesian classifier, based on the independent feature assumption is trained for each gesture. The classifier is trained on a data of 120 different NGT signs performed by 70 different persons using 7 fold cross validation. Their system achieved an accuracy of 95% true positives and the response of their system was just 50 ms which ensures real-time application.
Discussion:
This is a vision based paper throwing in lot of image processing stuff and classification is basically recognizing of the patterns created by hand and head observed in image frames. Since they are tracking blobs, I believe they have to specify the synchronization point for each person who is going to use the system. Also, if there is movement of head while performing gestures, it might affect certain gestures as head may be confused with hand blob. Also, there should be no object in the background which resembles a blob as it could cause miss- recognition. Also, the system requires that the person be constrained in an arrangement which may not be comfortable. I wonder if they have any user study for their system while obtaining data. As far as image processing stuff is concerned, they have done pretty much good job in taking care of the chromaticity changes by training the system on various illumination conditions and using normalized chromaticity value for each color.
Real-time Locomotion Control by Sensing Gloves
Since such a system requires calibration for matching the finger movements with walking pattern of the character, a calibration process forms the first stage. In the calibration process, the user mimics the motion of the character on screen by fingers. After this, auto correlation between the trajectories is calculated which maps the topology of the finger movement and the movement of the virtual character. It is assumed that the character performs simple movements like walking, hopping, running and trotting.
After determining the calibration, the next step is control process in which, the player performs a new movement by the hand; then the corresponding motion of the character based on the mapping function, obtained during the calibration stage is generated and displayed in real-time. The user can change the movement of the fingers and the wrist to reproduce similar but different motions.
The system was tested using the cyber gloves and the Flocker Birds at the testing stage as P-5 are sensitive to external noise and the sensor for 3D location is not very sensitive, if hand is away from the tower. In their test users were given30 minutes to get familiarized with the system and then they were asked to do certain tasks using keyboard and the sensing gloves. It is observed that the number of collisions using the sensing gloves were less for users though time taken was more.
Discussion:
This is a simple mapping of the finger movements to simple stick type character movements in the virtual world. I can understand that the users took time while working with the sensing gloves as most of us are more familiar with the key board and obviously we would be fast using it rather than the gloves. Also, working with gloves can be tiresome as with key board pressing one key does the task while here we have to mimic each and every step by hand movements. Also in zig zag mazy movements requiring sudden turning, it could be really painful. However, over all it was a nice application for small games. I could also be useful for small virtual tours in 3D space.
Sunday, February 17, 2008
A Survey of Hand Posture and Gesture Recognition Techniques and Technology
This paper is the survey paper dealing with the survey of hand posture and gesture recognition techniques used in the literature. Chapters 3, deals with the survey of various algorithmic techniques that have been used over the years for the purpose. Various approaches in the literature have been classified by the author into 3 major categories:
- Feature Extraction, Statistical models.
- Learning Algorithms.
- Miscellaneous Algorithms.
Feature Extraction, Statistical Models:
This category of methods deals with the extraction of the features in form of mathematical quantities from the available data which is captured through sensors (gloves) or images. The method is further classified into sub categories like:
Template Based: In this approach the data obtained is compared against some reference data and using the thresholds, the data is categorized into one of the gestures available in the reference data. This is a simple approach with little calibration but suffers from noise and doesn’t work with overlapping gestures.
Feature extraction: This approach deals with the extraction of the low level information from the data and combine the information to produce high level semantic feature information which can be used to classify gestures/ postures. The methods in this sub category usually deal with capturing the changes and measuring certain qualities during those changes. The collection of these values is used to label a posture which can be subsequently extended to a gesture. This method however suffers from heavy computational cost and also there should be a specific sequence that should frame a gesture else this method fails.
Active shape Models:
Active shape models deal with the image based gesture recognition systems in which they place a contour on the image which is roughly the shape of the feature to be tracked. The contour is then manipulated by moving it iteratively towards nearby edges that deforms the contour to fit the feature. This suffers from the drawback that it can capture those gestures that can be performed by postures requiring open hand. Also there is very little work in this direction. However with limited gestures meeting the open hand criterion, this method has been found to work in real-time. Also stereo cameras cannot be used.
Principle component analysis:
This method is basically the dimension reduction method in which the significant eigen vectors (based on the eigen-values) are used to project the data. This approach captures the significant variability in the data and thus can be used to identify the gestures and postures in the vision based system. Though this method can be exploited for glove based approaches also, but till that time (1999) only vision based techniques have been exploited. This method suffers from a drawback that there should be variance in the at least one direction. If variance is uniformly distributed in the data, it will not yield the relevant Principle vectors also, if there is noise, PCA would consider it as a significant bias too. Besides this method suffers from scaling in hand size and position, which can be taken care by normalization. Even then, this method is user dependent.
Linear finger Tip model:
This method requires special markers on the finger tips and then segmenting the finger tip motion form the scene image .This motion is analyzed for the possible gesture. This method works well for simple gestures and deals with the initial and final position with good recognition. However, the system cannot work in real time and recognizes a small set due to limited possible finger motions. Also curvilinear motions are not taken into account.
Cause analysis
This method is based on the interaction of humans with the environment and capturing the body kinematics and dynamics. This also suffers from limited gesture sets and no orientation information can be used. Besides, this system cannot work in real-time.
Learning Algorithms
These are machine learning algorithms that deal with the learning of the gesture based on the data manipulation and weight assignment. The popular techniques in this sub category are:
Neural Networks:
This method is based on modeling of the human nervous system element called neuron and its interaction with the other neurons to transfer the information. Each node (neuron0, consists of and the input function which computes the weighted sum and the activation function to generate the response based on the weighted sum. There are two types of NN, feed forward and recurrent. The methods based on this approach deal with the problem of heavy training and computation cost involved offline for training. Also for complex systems, such a model could be very complex. Also addition of each new gesture/posture requires complete retraining of the network.
Hidden Markov Model:
This method has been widely exploited for temporal gesture recognition. An HMM consists of states and state transitions with observation probabilities. For watch gesture a separate HMM is trained and the recognition of the gesture is based on the generation of maximum probability by a particular HMM. This method also suffers from training time involved and complex working nature as the results are unpredicted because of the hidden nature. For the gesture recognition, baki’s HMM is commonly used.
Instance based learning:
Instance is the vector of features of the entity to be classified. These techniques involve the computation of the distance between given data vector and instances in the database. This method is very expensive and for instance recognition, we need to maintain a large database and computation has to be performed for each instance to be recognized even when a given instance is provided for re-recognition.
Miscellaneous Techniques:
The Linguistic Approach
This method uses the formal grammar to represent the hand gestures and postures however limited. This method involves simple gestures requiring the fingers to be extended in various configurations which are mapped to the formal grammar specified by specific tokens and rules. The system involves tracker and glove. This system has poor accuracy and very limited gesture set.
Appearance based models:
These models are based on the observation that the humans are able to recognize the fine actions from the very low resolution images with little or no information about the 3D nature of the scene. These methods involve measurement of the regional statistics of the particular region of the image based on intensity values of the region. This method is simple but is unable to capture fine details in the gesture.
Spatio-temporal vector Analysis
This method is used to track the movement of the hand in the images of the scene and track the motion in the sequence of image. The information about the motion is obtained by the derivatives and it is assumed that under static background, hand motion is the fastest changing object of the scene. Then using the refinement and variance constraint flow field is refined. This flow field is captures the characteristics of the given gesture.
Application of the method:
This sections deal with introduction to the application of the posture and gesture in various domains like:
Sign language: where high accuracy of 90 % have been obtained under some constraints.
Gesture to speech: in this which hand gestures are converted to speech.
Presentations: Hand motion and gestures are used to generate presentations.
Multimodal interaction: Hand gesture and motion is incorporated along with speech to generate better user interfaces.
Human Robot interaction: Hand gestures are used as natural mode to control robots:
Other domains include, Virtual environment interaction, 3D modeling in virtual environment, television control.
Discussion:
This paper presents survey of the various approaches used for gesture recognition with beautiful classification into three approaches for recognition which have been further elaborated with methods involved in each approach. Most of the paper deals with the works done prior to 1999 and the advancements in the past 9 years are definitely worth exploring considering advancements in computational power, better sensors , gloves, vision capturing devices and human touch interfaces. It was interesting to note that the problems about the techniques that we discuss in the class have been known to the research community since last 9 years but still only few have been resolved and that too partially. This is mainly because of the complexities involved in the gestures and also absence of any robust segmentation approach.
Dynamic Gesture Recognition System for Korean Sign language
In this paper author has described a method based on proposed Fuzzy min-max neural networks. Considering the variety of possible gestures I the KSL, authors have reduced the number of possible gestures to 25 which according to them are the most common and basic gestures. The sensory information about the gestures is obtained by a data glove which generates 16 responses which is further reduced to obtain only the directional changes in the postures. Based on their data, they have reduced their data to frame 10-basic direction types which captures the directional change information in the postures. In order to reduce the data processing time and effective filtering, x and y range has been divided into 8 regions (based on their observation of the deviation in these directions) which form their local coordinate system. The directional information is stored in the 5 cascading registers. The directional information about each time unit is measured in these 5 cascading registers by “+” for the right/upper motion and “-“for the left lower motion and “x” is the no care position. Depending upon the previous position and the new position, measured through the values in these registers, change in direction is observed.
As per the author the 25 gestures have 14 postures which are recognized by the so called Fuzzy, min max neural networks. The Fuzzy min max neural network requires no pre learning about the postures and can be used for the online adaptability.
For recognition, the gestures generate the data which is inputted to the system which transforms this raw data set into asset containing small number of data which is then used to identify the direction class. After the direction class is recognized, posture recognition method is used to identify the gesture.
The complete system is represented by the figure shown below:
Discussion:
I did not have much to say about this paper because I have no clues what author wanted to convey by presenting the min max fuzzy neural network system which looks like a simple template matching system based on the direction segmentation of the postures. Except the direction values that have been used to classify the gestures, nothing is impressive. Also, they have considered 25 gestures with 14 postures which look simple with just finger movements (no yaw and pitch). Another flaw in the paper was that there is no mention of the user study for the paper as the min max values based on the data may be very user specific and the complete setup might require retuning with new data. As far as the results are considered, it looks very obscure to say about 85% as it is not clear if it is close to 85%, more or less. It would have been much better if they would have conducted various experiments and provided with some exact average classification. Also, it would have been nice if they would have spent some time explaining about their min-max NN which looks to me very confusing with the diagram presented.
Dynamic Gesture Recognition System for Koream Sign language
In this paper author has described a method based on proposed Fuzzy min-max neural networks. Considering the variety of possible gestures I the KSL, authors have reduced the number of possible gestures to 25 which according to them are the most common and basic gestures. The sensory information about the gestures is obtained by a data glove which generates 16 responses which is further reduced to obtain only the directional changes in the postures. Based on their data, they have reduced their data to frame 10-basic direction types which captures the directional change information in the postures. In order to reduce the data processing time and effective filtering, x and y range has been divided into 8 regions (based on their observation of the deviation in these directions) which form their local coordinate system. The directional information is stored in the 5 cascading registers. The directional information about each time unit is measured in these 5 cascading registers by “+” for the right/upper motion and “-“for the left lower motion and “x” is the no care position. Depending upon the previous position and the new position, measured through the values in these registers, change in direction is observed.
As per the author the 25 gestures have 14 postures which are recognized by the so called Fuzzy, min max neural networks. The Fuzzy min max neural network requires no pre learning about the postures and can be used for the online adaptability.
For recognition, the gestures generate the data which is inputted to the system which transforms this raw data set into asset containing small number of data which is then used to identify the direction class. After the direction class is recognized, posture recognition method is used to identify the gesture.
The complete system is represented by the figure shown below:
Discussion:
I did not have much to say about this paper because I have no clues what author wanted to convey by presenting the min max fuzzy neural network system which looks like a simple template matching system based on the direction segmentation of the postures. Except the direction values that have been used to classify the gestures, nothing is impressive. Also, they have considered 25 gestures with 14 postures which look simple with just finger movements (no yaw and pitch). Another flaw in the paper was that there is no mention of the user study for the paper as the min max values based on the data may be very user specific and the complete setup might require retuning with new data. As far as the results are considered, it looks very obscure to say about 85% as it is not clear if it is close to 85%, more or less. It would have been much better if they would have conducted various experiments and provided with some exact average classification. Also, it would have been nice if they would have spent some time explaining about their min-max NN which looks to me very confusing with the diagram presented.
Wednesday, February 13, 2008
Shape your Imagination: Iconic Gestural Based Interaction
This paper is basically an observation study in which computer scientists are observing the iconic gesture of the people from different educational domains and trying to test the hypothesis if iconic gestures can be employed as the natural and intuitive HCI technique for transfer of spatial information. The subjects, 5 males and 7 females, were taken from variety of education domains like: Languages, Science, Politics, nutrition and health and Library with age ranging from 21-31 years.
Subjects were seated in a quiet room and presented with name of a shape or an object and were asked to convey the shape or an object using non verbal communication using hand gestures. The primitive shapes chosen were, circle, triangle, square, cube, cylinder, sphere and a pyramid. The complex shapes chosen were: Football, chair, French baguette, table, vase, car, house, table lamp. These shapes were presented in an order to the subjects.
It was observed from the study that subjects preferred to use two hands to draw the virtual description (i.e boundary tracing etc) for primitive as well as some simple 2D shapes. For the complex shapes, they tend to use the iconic gestures. It was observed that, subjects also used pantomimic, dialect and body gestures sometimes in conjugation or sometimes in place of the iconic gesture. Some of the items (mostly 3D complex) were found to be too complex for the users. Complex items also took a more time to generate representative gestures.
Disussion:
The paper is petty straight forward. This is a observational study and some commonly understood/ observed facts were studied spending some money. It would have been interesting if subjects would have been told to communicate some ideas/ sentences using gestures rather than just objects.
Simultaneous Gesture Segmentation and Recognition based on Forward Spotting Accumulative HMMs
This paper presented a method based on forward spotting scheme that performs the gestural segmentation and recognition using the sliding windows and moving average HMMS. The sliding window captures and computes the observation probability of the gesture or non gesture using a number of continuous observations within the sliding window and thus computes the dynamics of the gesture without any abrupt start or end. Using the forward procedure, the averaged conditional probability is calculated for each observation window and HMM is trained using these probabilities measures and transitions. After training the required number of gestures, a non gesture model is also trained, which accounts for everything else that is not classified as the gesture. This model is trained by the same approach as for the gestures but using the non –gesture data as inputs.
In order to segment the gesture, author has argued that since gesture consists of number of postures, we can segment the gesture by dividing it into postures and then looking at the cumulative probability of the window. As per the author, for all gestural postures, the cumulative probability of the sequence will be higher than the non gestural probability and for the sequence of postures from non gesture, cumulative probability would be less than the non –gesture probability. Since we have small sliding windows, we can actually locate the position of the state at which probability became less and use that state for segmentation.
The complete system has a hierarchal structure where all HMM gestures are together in one class and there is a separate class for non-gesture HMM. A posture sequence provided as input is provided to the HMM Gesture class and to the non gesture class. The complete structure looks like a figure below.
It is similar to standard HMM except that we have non gestural HMM trained too which will give higher probability for the non gesture where as other gesture trained HMM’s would give higher probability for respective gestures.
Discussion:
I liked the approach of dividing the gestures into postures and then using the sequence of postures over small windows to determine the segmentation spot. However, I was disappointed by their simple observation sequence as each of the gesture was far more distinct and simple to what we are aiming at. Besides, computational time is also a questionable for this approach as it doesn’t look to me a real-time approach if we are looking at fine postures to frame a gesture. However, I believe for very distinct gestures like theirs we can use low resolution images which may work close to real-time. I believe the system is also very much user dependent as the non gestural movements may vary between person to person and some of the non gestural movements of one person may also correspond to gestural movements of another person (I mean it is not uncommon to do even stretching in different ways)
A Survey of POMDP Applications
This model takes into account the uncertainty in any decision process and extends the already prevailing MDP (Markov Decision process).The POMDP model consists of the following:
a finite set of states -S
a finite set of actions -A
a finite set of observations- Z
a state transition function -τ
an observation function -o
an immediate reward function -r
The states represent all the possible underlying states in the process which may not be even directly visible. The state transition accounts for the uncertainty by providing probabilistic measure to for a certain transition. The action set is all the available control choices available at particular time, whereas the observation set is the set of all possible observations available at a particular state. The reward function gives immediate utility for performing an action in each of the underlining process states.
The main objective of the model is to derive the control policy that will utilize the minimum possible information to yield the optimum decision. This is important because in many cases, complete information about the process is either not available or is very expensive. Thus, this approach minimizes the cost associated with the decision and also the computational complexity of the decision. This objective has been demonstrated by the author by providing various examples of application of POMDP domains various domains like, structural inspection, elevate control policies, fishery industries, autonomous robots, network troubleshooting, behavioral ecology, machine vision, , distributed database queries, marketing, questionnaire design machine maintenance, weapon allocation, corporate policy, moving target search, search and rescue, target identification, education, medical diagnosis.
Since the model is very heavily dependent on some of the detailed partial process information, author has warned that the model may not be very useful if we do not have the partial information about the process. In other words, we require the model to provide us with complete information about every possible observation and immediate reward for each state, action and observation. This information may not be always available for every application domain. The other important problem highlighted by the author is the user interface issues which deals with how all information can be provided to a system. Apart from this, there are also issues like computational complexities.
Discussion:
This per provided a clear case motivation to the use of POMDPs in variety of domains some of which seemed interesting. I was impressed by the way he has highlighted all information available form a given process and framed them in form of requirements for the POMDP.As far as the implementation details are concerned, there was almost nothing in the paper which seemed bit disappointing, but keeping in mind many non engineer person dealing with uncertainty issues (like ecologist, fishery people), this is a nice over view. I believe, POMDP’s have good application in many search and limited information navigation systems as system is capable enough to decide based on the available information. Keeping in mind our goals, I believe the system could be useful if we have a limited posture sequence forming a distinct gesture and the library of such gestures is small. Otherwise, it is computationally very expensive and we may not be able to provide all information required to model a decision.
Wednesday, February 6, 2008
A Similarity Measure for Motion Stream Segmentation and Recognition
This paper presents a SVD (Singular Vector Decomposition) approach for segmenting out the similar gestures based on the similarity measure they have introduced in this paper.
Any motion capture device captures the sensory information sampled over time. The sensory information from respective sensors occupies the columns where as the time sampling occupies the rows of any data matrix. Hence if we look at nearly same gestures, it is very difficult to maintain the same time samples because of variations in even similar motion. As such, we usually get the matrix for the same motion which is not dimensionally same. Also, given a huge data of the gesture matrix, we cannot arrive at a measure which captures the structure of the data, which can be exploited to identify the gesture as different from others.
This paper presents an approach based on principle component analysis which has been widely used in machine learning and pattern recognition domains to capture the components of maximum variance in the data. These components of variations are popularly known as the Principle components (PC) and by projecting the data using them as basis, we can obtain the discriminatory structure of the data. Usually only first few of the PC’s are enough for discrimination purposes. These Principle components ideally, for a similar gesture should be nearly parallel, but because of the variations, they are not. It is argued that for similar gestures, corresponding, principle components (eigenvectors) should contribute equally to the parallel ness. Also its is suggested by the authors that since the eigenvalues are related to the variances along the principle components ( eigen vectors ), they can be used to give different weights to the different eigenvector pairs.
Based on their observations, they framed a similarity measure called K-Weighted Angular Similarity or k-WAS, which captures the similarity between the two matrixes of data, based on a kind of average of the normalized eigenvalues for the two data matrixes and the contribution of each eigen vector pair to the parallel ness. Numerically, it is given by
In order to compare their similarity measures authors have tried the same on two data collection methods. One is the Cyber Glove and other is the video capturing model called VICON. It was observed that in case of cybergloves just 3 eigenvector pairs (out of 22 possible) were sufficient for accuracy where as for motion capture system, 5 out of possible 54 eigen vectors were sufficient.
In order to frame a comparison, authors have presented their results in comparison to existing systems like MAS and Eros, which was out performed by their measure called k-WAS.
Discussion:
I liked the paper as it was pretty straightforward with good results. I liked the fact that they were able to capture the variance using the weighted PCA approach which is a popular and simple approach for data analysis. They weighted PCA was similar to many existing works in other domains, though it was different that many works in the field of motion recognition which normally rely on complex HMM’s and lot of training. Apart from this, their system is unsupervised which makes it even more attractive. However, since they are working with just the sensory information, they have eliminated the temporal nature of the gestures form the consideration and hence, it is quite possible that even if the gesture is performed in a way which gives same values for the sensors at different times, they would be considered similar. I was impressed that they have identified the problem with their approach and suggested a dynamic time wrapping approach, though I believe, they have not implemented it.
As far as my knowledge of PCA is concerned, I believe it is very prone to noise and noisy data may cause great variation in the similar looking gestures affecting the similarity measure base on PCA. Also, because the variance information is exploited, if the gestures don’t vary a lot (which means, information is not contained in variance), PCA may not be helpful. I believe using PCA with some temporal approach would be a nice contribution.
Cyber Composer: Hand Gesture-Driven Intelligent Music Composition and Generation
This paper introduces us to a novel way of generating music according to the hand motions and gestures of the user, when the real instruments are not available. In order to have sync with the existing musical theories, the system has been embedded with the same, controlled by the motion sensing glove movements. The paper provides a nice overview of the musical theories and also how they have been incorporated them in their system.
The system consists of a Cyber Glove, with a 3D Polhemus sensor to each glove to track the 3D positioning, a background music generation module, a melody generation module and the main program. Keeping in mind the music standard in industry, authors have used
The system uses the sensory information from the gloves and they are sent to the Cyber composer main program. The main program transforms the signals into motion triggers which are used as the inputs to the various modules of the system. These inputs trigger the corresponding events in the modules, which are routed to the Musical interface connected to the standard
Authors have tried to keep their hand movements very intuitive to match the requirements of a lay person as well as the expert musician. In order to make the usage of the system smooth components were mapped to particular motions keeping in mind usage, frequency and flexibility of usage. For example:
- Rhythm was mapped to wrist motion in right or left direction.
- Pitch is mapped to relative height of the right hand to ground to the last note.
- Dynamics of the melody note are controlled by the flexion of the right hand fingers
- Volume is controlled by the extent of extension of the right fingers.
- Cadence (end of the music) is controlled by bending of the left hand fingers completely.
In order to add another instrument, left hand is lifted up higher than the right hand which enables the dual instrument mode and the second instrument starts playing in harmony with the first instrument.
Discussion:
The paper started with many expectations but latter turned out to be bit disappointment. There is no mention of the details of the implementations. As per their description, the signals from the motions are used to trigger the events and any motion of the part of hand corresponding to the musical component will trigger an event. For example, even if wrist is moved a bit, which is unintentional, event will be triggered. There is no particular gesture associated, but just motion of the parts which is not a good approach. There is no training involved, so, I don’t understand how the motions of different users will correspond to the same rhythm and harmony. Also the gesture for pitch increase is confusing as I am not able to understand, how to musician will keep the track of the relative position, and even if he does by some means, how in dual mode he can control that as both hands are involved. Besides, i don't understand if this can be useful to musicians as they often tend to use many instruments to compose music and I don't see any other way to add more than two instruments. Also during some compositions, some side music is added which may not be in exact harmony with the major notes, I don't see any way to incorporate that effect. There are many flaws in the approach which may end up in another paper, so I am stopping here. I believe approach could be refined by using machine learning models and using some easy better gestures which are not conflicting and confusing.