Wednesday, April 23, 2008
Toward Natural Gesture/Speech HCI: A Case Study of Weather Narration
This paper presents a gesture recognition system which can capture the motive behind the gesture rather than the description of the gesture for recognition. This means that if the user is pointing back, he can point back in any intuitive way as he likes. The system consists of the HMM which is trained on the global features of the gesture taking head as the reference (as shown in the diagram below).Color segmentation was used to extract out the skin which was tracked using the Predictive Kalman filtering. Authors have also suggested that in certain domains , certain speech follows the particular gesture which can be used as another feature to remove ambiguity in recognition.Authors observed that in their weather speaker domain, 85% of the time meanigful gesture is accompanied by a similar verbal word.. Using this knowledge, the correctness was improved from 63%-92% and accuracy to 62%-75%.
Discussion:
I liked that hey have pointed out the correlation between the words and the accompanying gestures. I think such a system which uses context (derived from words) would be a state of art development in the field of gesture recognition. They results (Though I am not convinced about their approach) demonstrate that context can improve accuracy. However, I am not happy with their too simplistic approach as lot of ambiguity can be introduced in such a system because of 3D nature of the problem and recognition using angles in 2D plane. Considering simple weather example, this may work to some extent, but results show that there is not much to expect from such a system in more complex situation.
3D Object Modeling Using Spatial and Pictographic Gestures
The system consists of two cybergloves , aPolhemus tracker and a large 200 inches projection screen where the objects are projected. Using the special LCD shutter eye glasses, user can see the objects and manipulate with them using his hands and pre-specified gestures.
Discussion:
I am not sure how easy or difficult is it to manipulate with the objects in the augmented reality.I believe if there would have been some haptic feedback, it would have been more natural.However, I liked the paper because it was different and had very practical application. Such a system can also be used by multi-user design team to create designs in the virtual world by interacting with the object together.
Device Independence and Extensibility in Gesture Recognition
They have tested their system on ASL (excluding letter J and Z) and have reported the accuracy in high 50's and mid Sixties by varying the sensors availabe in 22 sensor Cyber glove. They also reported that there was increase in accuracy when bigram context model was added to the recognition.
Discussion:
There is nothing great in the paper. using the predicates is an easy way but there is always scalability issue when users change. This method will not be suitable for multi-user environment. Even a single user would not be able to make this system to get higher accuracy because of intrinsic variability in the human gestures. That is why accuracy is very less. Also, using the less sensor data from similar glove does not make glove a different glove. As I understand different glove means altogether different sensors and architecture. I do not agree with their claim of device indpendence as they have projected in the paper.
Wednesday, April 16, 2008
Discourse Topic and Gestual form
In the paper authors have tried to find out the extent to which the gestures made are dependent on the topic (i.e. are speaker independent) and to what extent the gestures made are dependent upon the user. The presented frame work is a Bayesian framework which utilizes the unsupervised technique for quantifying the above mentioned extent. They have used vision based approach for the same.
Their approach utilizes the visual features that describe motion based on so called spatiotemporal interest points, which are actually a high contrast image regions like corners and edges that undergo complex motion. From the detected points, the visual, spatial and kinematics characteristics are extracted to frame a huge feature vector on which PCA is applied to reduce the dimensionality. The reduced dimensionality feature vector is used to form a mixture model and a code book is obtained. The dataset consists of the 33 short videos (duration 3 minutes) of the dialogues involving 15 speakers describing one of the five pre determined topics. The user set age ranges from 18 -32 and were native English speakers.
In the experiment each user was allowed to talk to another speaker though they were not asked to make gestures. The scenarios involve describing, “Tom and Jerry” and mechanical devices (piston, candy, pinball machine and a toy). It was observed that of the recorded gestures, 12% of gestures were classified as topic-specific with correct topic labels and with the corrupted labels this value dropped to 3%. This indicated that there exists a connection between discourse topic and gestural form which is independent of speaker.
Discussion:
12% gestures are classified topic specific if given with correct labels and 3% are classified correctly with corrupted labels. And to convey this message instead of using simple human judge to take a note, complex machine learning with vision based approach was used which they admitted must be bit corrupted because of computer vision errors. I am not very impressed with the paper.
Monday, April 14, 2008
Glove-TalkII - A Neural-Network Interface which Maps Gestures to Parallel Formant Speech Synthesizer Controls
There are three networks used which are: Vowel/Consonant network, Consonant network and Vowel network. A Vowel/consonant network is responsible for recognizing if the emit the vowel or consonant sound based on the configuration of the user hands. Authors have claimed the network can interpolate between hand configurations to produce smooth but rapid transitions between vowels and consonants. The network trained is a 10-5-1 feed forward network with sigmoid activation function. They have trained it using 2600 examples of consonants and 700 examples of vowels. Training data was collected from the expert user while vowel data was collected from the same user by requiring him to move hands up and down. The test consists of 1614 examples and MSE was reduced to 10^-4 after the training.
The vowel network is a 2-11-8 FF network and hidden units are the RBF’s which are centered to the respond to one of the cardinal networks. Outputs are 8 sigmoid units representing 8 synthesized control parameters. The network was trained on 1100 examples with some noise added. The network was tested on 550 examples and gave MSE of 0.0016 after training.
The consonant network is a 10-14-9 network with 14 hidden units as normalized RBF units. Each of the RBF unit is centered at Hand configuration determined from the training data. The 9 output units are responsible for the nine control parameters of the formant synthesizer. The network is trained on 350 approximants, 1510 fricatives and 700 nasal scaled data. Test data consists of 255 approximants, 960 fricatives and 165 nasal data giving MSE of 0.005.
They tested their product on an expert pianist and with 100 hours of training, his speech with glove-talk-II was found to be intelligible though still with errors in polysyllable words.
Discussion:
I just wonder how difficult / easy is to control your speech with foot pedal, data glove and polhemus all of which need to move in synchronization for correct pronunciation. Also with 100 hours of training results were intelligible and I doubt if this system is comfortable at all. But for a paper 10 years old, it is a decent first step and a different approach.
Sunday, April 13, 2008
Feature selection for grasp recognition from optical markers
In their experiment they have placed markers on the back of the hand and used the local coordinate system which is invariant to the pose.
In order to classify the gestures they have used a linear logical regression classifier which is used for subset selection in a supervised manner. They have used three markers a t a time as a single feature vector and then used across validation with subset selection with the number of features to arrive at the best feature set. They have tried both the backward and the forward approach for the subset selection and observed that the error between the two approaches is just 0.5%.
In their experiment, they have used a 90 dimensional vector for a hand pose representing the 30 markers on the hand. Their domain for the experiment is the daily functional grasps shown in the figure below.
They collected their data using the 46 objects which were grasped in a multiple ways. The objects were divided into two sets A and B, A containing 38 objects with the 88 object-grasp pairs and B containing 8 objects with 19 object grasp pairs. They collected data from 3 subjects for set B and 2 subjects for set A. Then they used 2 fold cross validation and with full 30 marker data obtained accuracy of 91.5 % while with 5 marker selected by subset selection they achieved 86% accuracy.
They also evaluated their feature subset on classifier trained on different data using both 5 and 30 marker set. They trained in total 4 classifiers (2 with the data from subject 1 and subject 2 respectively from object set A, 3rd one with the combined data of subject 1 and subject 2 from object set A and fourth on combined set A+B from subject 1 and subject 2).
They observed that the accuracy was sensitive to weather the data from the subject is used for training or not. With data included the accuracy for the user was 80-93% in reduced space (5 markers) and between 92-97% (with 30 markers). With totally new user accuracy tested on single user trained data, accuracy was abysmal 21-65% for reduced marker space. However the retain-ment of accuracy was over 100% in all cases.
In their analysis they also observed that the grasps for all the three subjects did well for cylindrical and pinch grasps however spherical and lateral tripod performed poorly because of the similarity between three finger precision grasps.
Discussion:
This paper has nothing new except the new complex linear logical regression classifier. Their analysis is also based on small user set and hence cannot be generalized to most of the cases. I think with more users with different hand sizes, it would have been a better paper. Also, I don’t understand why many papers claim that the accuracy increases with the samples from the user included in the training set. I think it is very simple and easy to digest fact which needs no explanation. Also, it would have been nice if they could have mapped the relation between the grasping patterns of the users, which might have been used for making more generalized set of features for a given set of users sharing similar patterns. The work is very similar to our paper on sketch.
RFID-enabled Target Tracking and Following with a Mobile Robot Using Direction Finding Antennas
Summary This paper presents novel application of RFID for robot target tracking and following. Authors have utilized the cheap though effective RFID technology for the purpose. These antennas receive the signal form the RFID transponder which generates the voltage which depends on angle of orientation of the antenna (sin of the angle)with respect to the transponder. This variation in the voltage can be used to find the direction of transponder .In order to find the direction correctly; they have used dual antennas which are at 90 degree phase difference. S The ration between the voltage can be calibrated to the angle information of the transponder (as the v1/v2= tan(theta), where theta is the angle between the transponder direction and the line of sight of the transponder with respect to the vertex of the angle formed by the two antennas.
The set of antennas is connected to an actuator, controlled by a micro controller, which is programmed to rotate the antennas such that the ratio (v1/v2) remains 1. This ensures that the transponder is always in the direction given by the angle bisector of the two antennas. The change in the direction triggered by the actuator is mapped to the actual direction of motion of the robot. Thus the robot can be made to track either another robot or reach some static destination.
Discussion:
This paper as discussed by all is more relevant to Robotic navigation and Rather than RFID. Though, it has given some good cue about utilizing the information based on the voltage ratios for direction tracking. May be we can club this information with the data gloves to get the 3D information about the hand position. I think if we can use a similar setup (orthogonal antennas) as a receiver and some RFID transmitter (tags) on the gloves, the ratio can be utilized to track the hand in 3D space with respect to the vertex of the receiver. Over all I liked this simple and straightforward approach.
Gesture Recognition Using an Acceleration Sensor and Its Application to Musical Performance Control
Using the temporal information available through the time series acceleration data projected in to the planes, 11 direction parameters (1 intensity component, 1 rotational component, 1 main motion component and 7 direction distribution components (computed by measuring density) ) are computed for each plane giving 33 parameters for the given gesture.These 33 parameters are then used for the recognition of the gesture.
For the recognition purposes, the given gesture samples are collected from the user and then using the difference between the gesture acceleration with the mean of the standard patterns, normalized with the standard deviation is computed which gives the error measure (weighted error). The gesture is recognized as the one belonging to the standard pattern, which gives the minimum weighted error.
They have tested their approach for generating music based on the gestures of the condutor. They have also used dynamic linear prediction for predicting the tempo based on the information of the previous tempo. This according to them gives a realtime results comapred to the image processing based approach.
They have claimed results meeting 100% accuracy with same user while performance declining with different user
Discussion:
This paper presents a simple though interesting utilization of the acceleration data for gesture recognition.However, I think that a given user can perform the same gesture quite differently(based on acceleration data) depending upon the energy and enthusiasm, so their approach may fail even for a same user (for whom system has been trained). I guess they are using some threshold to determine the start of the gesture and then some threshold to mark the end. I believe,the initial start will have an abrupt acceleration while will become kind of constant for the rest of the gesture and then for change of gesture, there will be abrupt change in acceleration (direction + Magnitude) finally leading to decrease in acceleration leading to final stop. The changing values of acceleration can be used for segmenting the gesture.
I am not sure if they have used a series of gestures or just one gesture at a time for their experiment. Also, it would be interesting if they use multiple user data (normalized) for the training set and then check the accuracy for the multiple users.
Thursday, April 3, 2008
Activity Recognition using Visual Tracking and RFID
In this paper, authors have presented the use of computer vision and RFID information for capturing the activity of the subject. Using the standard methods based on skin color based segmentation, they obtain the skinny region (hands face). By using the information about the area of the bounding box, they recognize the hands and track them.
All the objects are having a RFID tag which is read through a RFID reader. The RFID reader consists of an antenna which emits the enery that is used by the RFID tag to get charged. The RFID antenna is periodically switched off during which, the capacitor in the RFID tag looses charge which communicates the ID of the Tag back to the reader by modulation of the energy. The authors have modified the above mentioned mechanism is modified in which the reader is provided with the aditional capability of capturing the voltage values emitted by the RFID tags.
Since the tag would receive maximum signal when it is normal to the antenna field. The variation in the energy received and emitted can be used to capture the orientation information about the ibject being manipulated by the user.
In a nut shell:
RFID tag is used to capture if the object is being manipulated by the user or not and what is the object being manipulated.
Computer Vision is used to obtain the position of the hand. If they are close to the object being manipulated (measured through RFID), it is said that user is approaching the object.
Tracking his hands along with the orientation information from the RFID tags is used to recognize activity. As an example they have shown the activity of a subject in a retail store environment.
Discussion:
This is really a nice and novice method of recognizing activity and may be it can be helpful in Dr Hammond's favorite "wokshop saw" activity also . But my concerns are:
1. If RFID can provide a good estimate of the distance.
2. If they are unaffected by presence of other magnetic devices in the proximity.
Over all it is a nice approach and as discussed in the class is a nice and cheap direction to look at for future research.
Gestures without Libraries, Toolkits or Training: A $1 Recognizer for User Interface Prototypes
The authors have presented simple yet robust method of recognizing single stroke gestures (Sketch gestures) without use of any complex machine learning algorithm. The idea behind the complete $1 recognizer is simple:
- Capture different ways of drawing a shape.
- Re-sample the data points so that over all shape is preserved.
- Rotate the sampled points by an indicative angle, which is defined by them as an angle between the gestures first point and the centroid. According to them this indicative angle rotation helps in finding the best match.
- Then the samples points are compared against the available templates on the database.
- The closest matching template is the recognized result.
Since sampled points are dependent on the drawing style of the user, the two similar looking shapes may have different template matches because of the variability in drawing style. This is eliminated by providing the database with all possible drawing style samples.
They have compared their algorithm with the Rubine algorithm and have reported better accuracy than Rubine. With 1 template, they have obtained accuracy of 97% which improved to 99% with 3 templates per gesture.
Discussion:
I have same thoughts like Brandon. being an instance based algorithm , if your database is too large , this would not yield faster results. This means that when we present the gesture to the system, it does all the pre processing and searches through data set.Even if we repeat the same gesture, it will search the complete data set again .It does not keep a notebook record of expected gesture even when gesture is repeated again. In other words , it is instance based algorithm with memory erased after each search is completed. May be by some book keeping we can improve the algorithm. Other wise it is a beautiful and simple algorithm .
Enabling Fast and Effortless Customisation in Accelerometer Based Gesture Interaction
1. Using the sensors to obtain the accelerometer data.
2. Sampling the data again and normalizing the same equal to equal length and amplitude. The data is reduced to 40 sample points per gesture.
3. Sending the data for a gesture to the Vector Quantizer to reduce the dimensionality of the data to 1-D. (Assigning labels)
4. Then the vector quantized data is used to train the HMM which is then used for the recognition.
For their training, they are adding artificial noise to introduce variability and thus increasing their training data set. They found that with SNR=3 Gaussian noise, data achieved best accuracy. They collected 30 distinct 3D acceleration vectors from one person and selected 8 gestures for the control. Then they added noise to introduce variability and get additional data. They then used the data for cross validation to obtain the best training/ testing set . They got an accuracy of 97.2% with SNR=3.
Discussion:
This paper was strange for me. I don't understand, If I am correct they are using the same data for vector quantization (after adding nose too) and then they are testing on part of the data. And after using vector quantization why are they using HMM's. Also addition of noise cannot introduce the variability that different user can introduce, so their accuracy means nothing to me.
Sunday, March 30, 2008
SPIDAR G&G: A Two-Handed Haptic Interface for Bimanual VR Interaction
This paper presents a Haptic device called SPIDAR which is used to interact with the virtual world. The device consists of the ball in the center which is attached to different pulleys through strong nylon threads. This system provides user with 6 DOF and 7th is the grasp which is provided using a pressure sensor on th ball.
The interaction with the virtual world is through the movement of the SPIDAR ball in the restricted environment (each SPIDAR corresponds to the single object)provided by the arrangement. The motion of ball in a given direction creates tension in some of the strings of available direction. This tension drives the pulley against resistance provided by the motor. This motion is captured and used a the input for the motion of the object associated. Authors have stated thats such a system can be beneficial for use in tele-operation, medical operations. molecular simulations etc.
The system of two SPIDER was tested on three users, where each of the user was assigned a task of controlling a sphere in a virtual world with one hand (SPIDER) and using the other hand (SPIDER) to touch a pointer to the marks on the sphere . They observed that people liked SPIDER -G&G (bi modal version) compared to the SPIDER-G (single mode version) because the bi modal version seemed much intuitive.Also they found that users were able to perform better when provided with haptic feedback.
Discussion:
This device is a self developed device by the authors and has nice combination of the mechanics of strings and computer manipulation of the data. Though device has a good feedback, movement of the ball is restricted by the strings as they may interfere with each other. Also, we have to apply a balanced force to interact with the system as the system is not fixed and may fall down with more force and with less force may not give desired result.
However the cost involved in such a system is a limiting factor and also no new work has been reported which limits my knowledge about the current state of the system. Also since there is one SPIDER per objects,the interaction is very limited. May be with some kind of switch single SPIDER can be used to interact with the other objects with just the press of the switching switch. Also, it would be interesting if the similar objects can be grouped together and then the single SPIDER can be used to manipulate with them in virtual world.
Since I have personally used the system, I believe it is one of the stand apart application and very useful in terms of interaction response provided by the device.
Gesture Recognition with a Wii Controller
- Get a wii Controller and obtain the acceleration values.
- Filter the values and remove some of the redundant values
- Use Vector Quantization (K-means to form clusters)
- Feed to HMM with Bayes classifier
- Get the results (90%)
Discussion:
Nothing to discuss as I just see it as application paper.
Taiwan Sign Language (TSL) Recognition Based on 3D Data and Neural Networks
This paper presensts a Neural Network based approach to recognize the 20 Taiwanese Sign Language Static Gestures. They have proposed a neural network (Back Propgation NN) for recognizing the 20 static Taiwanese gestures that are captured using a vision based capturing device called VICON. Using markers on the dorsal surface of the hand, they capture the features for the given gesture. The given gesture features are actually distance measures of the marker positions relative to some reference. The distances are normalized to take care of the variable hands and then used as the feature inputs to the Neural Network. The neural netwok is trained on the similar data obtained from the users. Authors have reported that they have used data from 10 students which repeated each of the 20 gesture 15 times thus providing 3000 data samples in all. Out of 3000 data samples, 212 were reprted to have missing values and were not used. Out of the rest 2788, 1350 samples were uysed for training and 1438 were used for testing.
Their NN architecture consists of 15 input Neurons and 20 output neurons aong with 2 hidden layers. With 250X250 neurons in the hidden layer they have reported accuracy of 94.5 on the test data while on the training data it was 98.5 (not important). 15 input neurons are for the 15 feature vectors used and 20 output neurons provide the output probabilities of each of the gesture.
Discussion:
This paper was simple and straight forward. The way of obtaining the features was simple considering only the distances between the markers were chosen and also gestures chosen had no occlusion. Considering the statis gestures, I think this is the problem in 2D rather than problem in 3D as the 3rd dimension is / can be always constant for the static gestures. The training and testing data is obtained with much care which may affect the recognition , if input data is taken from outside the users without much instructions. There is not much take home message from this paper except their way of obtaining the distance metrics that can be used as features.
Hand Gesture Modelling and Recognition Involving Changing Shapes and Trajectories, Using a Predictive EigenTracker
This paper presents an approach of gesture recognition using the vision based techniques. This technique boasts of no training involved like HMM's ,hence faster adaptability to new gestures . The only requirement (which is kind of not nice) is that the gestures should be well distinguishable. The algorithm obtains the affine transforms of the image frames and projects the image to the Eigenspace. In the Eigenspace, since only hand is moving, while the background is stationary, the first few PCA components would capture the maximum variance,i.e in fact the motion of the hand in each frame.
This method is inspired by another similar method called Eigen tracker, however, differs from it because of the added predictive modality. The predictive nature of the proposed method makes it versatile enough to track hand motion on fly without any requirement for description of the orientation and physical dimensions of the object offline as required for the previous eigentracker method. This predictive nature is induced by using the skin color for segmenting out the hand and then using particle filter to track the hand. The information about the position of hand is obtained by the significant eigenvectors in the eigen space as only hand is in motion while rest of the background is stationary.The variations in the motion direction are captured when the error between the prediction and the actual track exceeds certain threshold.Tracking Hand, along with the information about the change in motion track (captured by the error between prediction and actual position)can be used to map the information to a gesture would give the structure of the gesture (assuming linearity between motion ) which can be matched against the available gesture set (which is decided off line).
Discussion:
Well the nice part of the paper is the use of Particle filter for obtaining the information about the segments of the gesture by measuring the error between the prediction and the actual motion and the use of affine eigen space for capturing the hand motion. Well the method proposed is much different from the most of the papers we have read , though the results are not as impressive. As suggested by many in the class 100% doesn't make sense as the test data is 80% similar to the train data (64/80). Also, they have used PCA space to obtain the maximum variance, which is not robust for noise. I agree though with their simple steady background and black arms,it would work though in many real life situations this may not be feasible. I would have been happy with some complex gestures with slightly lower accuracy rather than 100 % accuracy (which I am not impressed with) with very distinct simple gestures.
However, I like vision based techniques as they provide much freedom and space for gestures which gloves do not provide and as such I would add this paper to my favorites in the semester.
With the advancement in the digital imaging techniques and capturing devices, it is possible to change the background to some other stationary background. So if the relative motion of the hand captured is faster than the change in the background, we shoule be able to capture the hand motion and blend it with some artificial background. By such an appraoch, we can tackle the problem of noises associated with changing background in PCA based techniques.
Friday, March 21, 2008
Wiizards: 3D Gesture Recognition for Game Play Input
Training involved 7 different players who were asked to perform a given gesture over 40 times. The HMM presented accuracy of maximum 93% with 15 states and 90% with 10 states using test data from the same users as used for training. With new user there is a drastic drop to 50%.
Discussion:
Not at all impressive though something different.I am tired of explaining HMM's but most of the results they presented were quite obvious as all training based algorithms improve as more data is available.Nothing much to say
TIKL: Development of a Wearable Vibrotactile Feedback Suit for Improved Human Motor Learning
This kind of feedback supplemented by the conventional auditory and visual feedback would definitely help the learners to learn better by knowing the errors in the body kinematics by vibratory feedback at that points.The system uses a Vicon motion capture system to track motion and the suit contains the vibro tactile sensors that provide the feedback and needs to be worn during the training
Possible applications have been suggested as sports training, dance, and other similar activities. They also conducted a user study in which around 40 participants were included but only 20 were provided with the suits and rest 20 were trained without suits. It was observed that the participants with the suit performed better because of the vibratory response. It was observed that the users with the suit had a 27% improvement in accuracy and an accelerated learning rate of 23% over the non suited counterparts under similar rest conditions.
Discussion:
It was a nice paper well written with good explanation. I liked the approach however it would have been nice if legs and the upper body feedback could also be provided as wrong placement of the three parts of the body can cause injuries.Such a system also provides an excellent way to have a remote teaching school as a instructor may not be physically availabe and the less skilled students can be trained online. I had some similar work in my under-grad when I used ANN to teach less skilled drivers real time- steering control for driving.
Spatio temporal extension to isomap nonlinear dimension reduction
The approach involves:
1 Windowing of the input data into temporal blocks which basically serves as a history of each data point.
2 Computation of sparse distance matrix D from the local neighborhood using Euclidean distance.
3) Using the Distance matrix obtained in above step to obtain the common temporal neighbors CTN which are either local temporal CTN or K-nearest non-trival neighbors.
4) Above measure is used to reduce the distance between points with common and adjacent temporal relationships.
5) The above metric is then used to obtain the shortest pair distance metric using Dijkstra's.
6) Classical MDS is applied to preserve the spacing .
Step 1,3,4 are the contribution of the paper . These steps introduce the temporal information in the Isomaps.
This paper presents the approach to apply ST isomaps on continuous data where K-nearest non trivial neighbor metric is used to find the best matching neighbor from each individual trajectory and removing any redundancy of selection of neighbor. Where as, Local Segmented common temporal neighbor is used for measuring the distance metric for non continuous data. The Segmented Common Temporal Neighbor hood approach is based on the logic that the pair of points are spatio-temporally similar, if they are spatially similar and the points they are transition to are also spatially similar.
They have applied their method a tele-operated NASA Robot to grasp wrenches placed at various locations, on obtaining the Kinematic motion data from human subjects. They have also given a comparison with PCA and standard Isomaps.They also showed some comparizon with HMMs.
Discussion:
They have added another important dimension to the data that can actually capture the motions that repeat in time and may be structurally similar though temporally they may be different for example a spiral motion . Adding temporal information makes a lot of sense as many dimensionality reduction techniques may give false results as they cannot distinguish between repetition of data as a temporal characteristic but in fact it will take it as redundancy of data which in fact it is not (I mean all data is not redundant).
I can now see that for gestures and motion capture we cannot use PCA if our gesture contains repetition of motion and for such cases ST -Isomaps is the solution to capture the embedded motion which characterizes the gesture
Wednesday, March 19, 2008
Articulated Hand Tracking by PCA-ICA Approach
They have represented the hand motions by modeling a Hand model in open GL and then using the information about the degrees of freedom of the hand fingers to obtain the various possible combinations available in which fingers touch the palm.
They used the data gloves to obtain the joint information of the possible 31 combinations . They then used that data to obtain the model parameters for various combinations generated by the open GL model over a time span and obtained around 2000 dimensional vector for each posture.
By PCA they reduces the dimensionality of the problem and were able to locate the position of the maximum variance in the image frame. Then by using the ICA model where each basis represents the motion of a particular finger they obtained the hand pose for a given time frame. They used the particle filtering method to track hands in accordance with the bayes theorem.
They employed the edge and silhouette model to match the hand frame with the open GL model and then estimated the closest match between the hand image and open GL models.By superimposing open GL model on hand image, they were able to recognize the posture.
Discussion:
I liked some different approach in this paper though I don't agree the statistically independent nature of fingers exists for all the hand postures. But considering the simplicity of their gestures, it might work. I likes that they used PCA for global hand tracking but PCA requires the bacground to be stable and only hand moving to track the variance. If there is some change in background (like the user moved a bit) PCA may give erroneous global results. Though for limited region, it may be feasible and simplest approach.
I would like to think more about the feasibility of ICA for intrinsic finger tracking, though presently I believe it is not possible to track fingers by this approach for the kind of complex motions we are aiming at,
The 3D Tractus: A three Dimensional Drawing Board
In order to capture the 3D data, they have used a simple potentiometer whose resistance varies with the movement (up and down of the mechanical structure), this information is calibrated to obtain Z value using a Analog to digital converter and provided to the PC via a USB connection.
The user interface of the system consists of a 2D drawing pad and a window which displays the view of the object being drawn by the user. In order to provide the depth cue in 2D, authors tried to use different color cues but found it to be confusing. They also tried using the varying thickness cues but found it also to be non intuitive. Finally they decided to present users with just the information that all thin strokes shown are actually below and users are drawing the top strokes. Similarly authors found out that projective views were more intutive and helpful than the orthographic projections and as such the window which displays the object being drawn shows projective view.Their system also provides provision for deletion so that users can edit their sketches.
They conducted a user study where they asked arts students to get familiar with their system and use it to draw certain sketches. Authors observed that the users liked their system though each user agreed that it was easier to push the table down than to pull it up.Also users reported that it would have been better if they could tilt the surface in the direction they were sketching the object. Also, they complained about alignment issues as they found it difficult to match the 3D symmetry of object being drawn (like top and bottom of Box)
Discussion:
I think it was a cool idea but very uncomfortable as user has to push and pull the table which seems much unintuitive to me. Also I would not be surprised if the user unintentionally pushed the table while sketching as some people tend to sketch with hard hand.
A hidden Markov Model Based Sensor fusion Approach for Recognizing Continuous human grasping Sequence
For the classification purpose they have used Kamakura grasp Taxonomy which separates grasps into 14 different classes according to their purpose ,shape and contact points. With this taxonomy, it is easier to identify Human grasps that humans use in the very day life. As per the Kamakura taxanomy, gestures have been classified into 4 major catagories:
1. 5 Power Grasps
2. 4 intermediate grasps
3. 4 precision grasps
4. 1 thumb less grasp.
Each Grasp is actually modeled as a HMM and used for as a classifier.The data for the HMM is obtained through Cyberglove and this data is fused with the sensor data obtained from the tactile sensors for the particular grasp. This is done as using just the information from the data gloves may not be correct as the shape of hand between two grasps may not change significantly. They have obtained 16 feature values from the gloves and 16 sensor values along with the one maximum sensor value to frame the feature vector for the particular grasp. Their system taken in both the inputs simultaneously and learns to weigh their importance by adjusting parameters during the training.
Their model consists of 9 states for HMM's and the HMM's are trained offline for each gesture. Along with the 14 HMM's for grasp classes, a junk class for garbage collection was also trained. They have made a simple assumption that each grasp must be followed by a release. This is done to ensure segmentation of the gesture and the maximum of the grasp gives a cue about the grasp and non grasp.
For modeling the HMM's they have used the Georgia Tech HMM toolkit and collected 112
112 training samples and 112 testing samples from 4 different users. They have reported an maximum accuracy of the single user model (trained on 1 user data and tested on same user data)as 92.2% with minimum of 76.8% and for the multiple user system (trained on all and tested on a given user data) they have reported accuracy of 92.2% as maximum and minimum accuracy of 89.9%. They have also suggested that with more user data, single user model may get better than multi user model. They have also claimed that most of the recognition error came from a small set of grasps in which the system relied solely on tactile information to distinguish the grasps. They believe that improved sensor technology may improve the results.
Discussion:
This paper presented a new method which utilizes the tactile information for gesture recognition. It made much sense to me as common gestures for day to day tasks contain much of the tactile information which can be exploited to differentiate between the two similar looking gestures. for example a tight fist and hollow fist (sometimes used to show O) may look similar to the glove but including the tactile information can distinguish between the two.
Also, it would be interesting if the vision can be used to add more flexibility to the system as a similar looking gesture (based on tactile and cyberglove) at a different position in space may convey some different meaning, (but , this can be done using the Flock birds too). Also this paper made gave me another approach for the segmentation problem based on utilizing the tactile information
Monday, March 3, 2008
Temporal Classification:Extending the Classification Paradigm to Multivariate Time Series
This paper is basically a part of the much detailed thesis dealing with the Australian Sign Language. They are using two gloves to capture the data and analyze the recognition rate. One of the gloves used is the Nintendo glove and the other is a device called the Flock data. Nintendo is a low cost glove with the cheap sensors and Flock is the complicated superior device. The data obtained is used in their classifier called the Tclass, which looks like a decision Tree type classifier, and using the different parameters of the Tclass results were obtained.
Since the data obtained from the Nintendo glove is noisy they have used smoothening to tune in their results which helped to get better accuracy. In the end they used a voting methodology to get the best learners, similar to the ada boost, to improve the accuracy and decrease the error.
With the flock, they did the same thing how ever they found that the smoothening is actually affecting the recognition results as the sensor data is already much refined with almost nil noise.
Discussion:
The have shown that their classifier called the Tclass was able to provide a low error rate, by using the ensemble. They tested their data on Nintendo and Flock and smoothening worked with the Nintendo and not with the flock. I believe,it is because the data from Nintendo is so noisy that the distinguishing features are suppressed by the noise, while in case of the flock data, if we tend to smoothen the refined data accuracy will drop as the distinguishing features associated with the data are smoothened. It would have been nice to read in actual what Tclass is and what it does.I believe the only good thing in the paper was the Tclass classifier which is actually some kind of decision tree based classifier which needs to be investigated.
Using Ultrasonic Hand Tracking to augment Motion Analysis Based Recognition of manipulative Gestures
This paper deals with the introduction of the ultrasonic sensors with accelerometer and gyro meters for capturing the gestures to recognize the activity. The contribution of the paper has been claimed as the use of ultrasonic for the motion analysis and combining the information from the sensors to refine the results. The information obtained by different sensors used, is processed by a classifier so that the motion can be recognized. The classifiers used are: HMM’s, C.4.5, KNN. For capturing the sensor data, three ultrasonic beacons are placed on the top of the roof and the listeners are placed on the arms of the users. It is reported that the ultrasonic deals with the problem of reflection, occlusion and temporal resolution (low sampling rate) and hence the information provided by just ultrasonic sensors is not reliable and there are many false responses associated. Apart from this there is noise associated with the sensor input over the time frames which cannot be smoothened using Kalman filters as the sampling rate is much less than the frequency of hand movement.
In their experiment they have taken an example of bicycle repair and have chosen following gestures involving pumping, screwing screws or de-screwing them, different pedal turnings, assembly of the parts, wheel spinning and carrier object removing/ placing
They have tried approaches that can be classified into two categories:
- Model Based Approaches
- Frame based approaches.
In the model based approach they are using the HMMs on the sensory information obtained by the 2 gyro meters and 3 accelerometers on the users’ right hand and also the same set of sensors on the upper right hand.
In the frame based approach the feature vectors are obtained during each time frame and used for either training or testing of the classifier. The set of features extracted are: mean, standard deviation, median of the raw sensor data. This approach captures the local features and can be used to obtain the local characteristics which can be exploited for training or testing the classifier. For their approach, they are using no overlaps between the adjacent windows and using the part of the feature vectors from a frame for training and part for testing. For the comparison they are using the Kmeans and the C.4.5 classifier for testing the frame based approach.
They also proposed the use of plausible analysis for classification which actually means restriction of the search space for the vectors and using the information from both frame based and model based approaches for classification. For example the result of the gesture recognition obtained by the HMMs is compared with the constraint restriction imposed and if it is satisfied, the gesture is selected else next best gesture satisfying both constraints is selected.
In the results they have claimed that the ultrasonic with the C.4.5 classifiers produced the results close to 58.7% and with K-means they have shown 60.3 % classification. They have argued that since most of their gestures are not distinguishable using only hand locations, the results were affected. They then used a kind of ensamble to merge the inputs from the accelerometer and gyro meter and obtained a high classification in (90 % range). They argued that for certain gestures they almost achieved 100% while for some gestures, which are ambiguous and can be confused with other gestures, there was a drop in classification.
Disussion:
They have just merged the ultrasonics with the accelerometer data and used the same to get the sensory information of the local movement associated with the body part with the global position of the part. This method is definitely going to yield better results as we have two layers (Global and then Local) of classification as we vote for the gesture that meets requirements for both layers and it is no surprising. The method is intuitive but needs a specialized room, as ultrasonic waves are reflected by metallic objects and also occlusion affects the response. They have not addressed the issues with occlusion associated with our requirements as we are dealing with fingers and hand movements which cannot be prevented from occlusion using the top, mounted beckons. May be we can use array of beckons on ground ,top, sides to capture the responses but that would restrict the doimain of application for the user.
Also I feel, instead of wired accelerometers and gyro meters, it should be nice to use the wireless sensors which were described in the last week’s paper on “ASL for game development”. I would be really light weighted and more particle to use.
Wednesday, February 27, 2008
American Sign Language recognition in Game Development for Deaf children
As per the author's there is no ASL engine existing to test the setup so they conducted their experiments using the Wizard of OZ study using a human wizard which would be eventually replaced by the computer latter.Since the system is in nascent stage, they have limited the ASL to just single/ double handed gestures and no facial or other expressions. The vocabulary was chosen such that it was comparable to the system constraints as well as the standards of what is taught in the real class. The system follows push to sign approach which means that the user has to push a button to activate the recognition system and then do the gesture which is recognized.
This system, consists of the colored gloves and the wireless accelerometers which capture the motion data of the hands using the gravitational effects on the X,Y and Z coordinates. All the related hardware was developed in house. Since they are using the different color gloves, they are using the color segmentation approach in which the discriminatory information of the background and the glove color, based on the HSV histogram is used for segmentation .The data received from the vision and sensor based approaches will provided to the trained HMM which then recognizes the sign and triggers the mapped action in the game. The HMM tool kit proposed for the system is the GT2k developed at Georgia Tech. They are using the human observer to prune the responses and label them as correct or incorrect.The system Architecture is shown in the figure to the left.
They have reported their results as the user dependent and the user independent models. In the user dependent models , they obtained accuracy of 93.39% by training on the 90% data and the testing on remaining 10 % data repeating it 100 times. In the user independent models, they have obtained accuracy of 86.28%.
They have reported the success rate of 92.96 % on average in all samples at the word level samples, however they have reported that for the sentence level, their system gave less because the words can be deleted and added which causes less accuracy.
Disussion:
This paper presented a nice system which can teach children with hearing disability to learn the ASL in an interactive way through GAME which makes it more exciting than boring classes. A wizard of oz study ensures that it is understood how children would like to interact with the system and thus gave an idea of what the system should look like and interact.The usage of vision with the simple blue tooth wireless adapters was interesting as it makes them free from wires that may make things messy.Over all it was a nice paper with a nice practical application, but I still donot have GT2k anywhere on line!!!
A Method ofr recognizing a Sequence of Sign Language Words represented ina Japanese Sign Language
In order to identify the borders they have proposed two measures. One is the measure of change of velocity and the other is the measure of the change in direction and hand movements dynamically. The segmentation point is registered if the measure exceeds certain threshold. Since the measure of the hand movement can lead to the different borders because of noise, they proposed using the hand border as one which is closest to the border detected by the change in velocity border. Using this information, gestures can be segmented from the stream of gestures and sent to the recognition system for identification.
In order to identify the hand which is used in the gesture, they calculate several parameters representing the difference between the movements of the right and the left hand. These measures are basically the velocity ratio relative to each other and the difference of the velocity squared value normalized with time of the gesture.These parameters are calculated separately for the left and the right hand and if the value is less than certain threshold for bot hands then both hands are used else one hand is being used.They have used another measure which is based on just the relative velocity of the hand to determine which hand is used in the gesture.
The sequence candidates are generated by evaluating a measure, which takes into consideration if the identified segment is the transition or a word, and only considering the words. The segments identified as words are combined with the segments identified as transitions using a weighted sum and this is used for a sentence.
They evaluated thus system using 100 samples from JSL which included 575 word segments and 571 transition segments out of which 46`(80.2%) of transitions were correctly recognized and 64(11.2%) were misjudged as words.
Discussion:
This paper presented an approach which was similar to what we have been using in sketch i.e taking cues from the speed and direction about the stroke. I am not sure if we can use the velocity cues with much accuracy as normally the gestures are made very fast, however I liked that they have also used the change in hand movements too and then used both of the cues to identify the border. However, I did not like the flow of the paper and I was confused with the language too. It was not very clear how they got the thresholds and if people with different styles and speed can get use the system with same thresholds. Also they have admitted that some of the gestures were mis identified as their system does not take any spatial information which may the error in many gestures.
'
Monday, February 25, 2008
Computer Vision based getsure recognition for an augmented reality interface
It is stated that to interact in the virtual environment, system should be able to select the object by pointing towards it and selecting it by clicking. As such for any such system it is important to these two features as the primary features. To make the pointing gesture intuitive, they are using the index finger as the pointer. They are also using some of the basic gestures, very different from each other, shown in the figure below:
By constraining the users to perform the gesture in one plane they are restricting the problem to 2D though it is 3D in nature and it is stated that after some trails users were able to adapt to the constraint without much difficulty.
As a first step, system involves segmentation of the fingers so that they can be distinguished from the place holder objects. For this they have used color cue to segment out the skin from the other objects. In order to deal with the issues of intensity and illumination changes, they are using the color space which is invariant to illumination changes i.e. normalized RGB space (chromaticity). In this space different objects form different clusters. These clusters are used to frame the confidence eclipse, by measuring the mean and the covariance matrices and distance of the chromaticity of the pixel is measured in terms of mahalanobis distance and thus we obtain different labels for different chromaticity value pixels.The predetermined size blob is then labeled as the hand and small blobs which are actually misclassification objects are discarded. The pixels in the hand blob which are missing are filled up using the morphological operators. To take care of the dynamic range issues, only pixels with certain minimum intensity are considered for the process. On the higher end the pixels which have at least one channel with 255, are discarded.
Since each gesture can be recognized by the number of fingers, they have used polar transformation and the number of concentric circle to measure the number of fingers lying in the each radius. The click gesture is the movement of the thumb and they are using the bounding box measure to determine if the thumb has moved or not by measuring the bounding boxes of the series of frames.
Discussion:
I chose this paper as I thought it would be nice to talk about the role of gestures in the augmented reality. This was a very simple paper and with very simple gestures that they are recognizing. The good part is their hand segmentation approach and some new ideas in term of augmented reality office which have come in some of the discussions. I did not like that though they claimed that they are recognizing 3D gestures, but by constraining users to move in a plane they forced the problem to simpler 2D. I believe that their recognition approach, based on counting fingers cannot work in 3D as the occlusion between the fingers will give ambiguous recognition results. However, I liked the approach they have presented, as by using such a system, interacting in a design meeting would be much interactive and less confusing.
Sunday, February 24, 2008
Georgia Tech Gesture Toolkit: Supporting Experiments in
This tool kit provides the users with the tools for preparation, training, validation and recognition using the HMM. The preparation involves the user to design the appropriate models, determine the appropriate grammar and providing the labeled examples of the gestures to be performed. All these steps require some analysis of the available data and the gestures involved. The validation step involves the evaluation of the potential performance of the overall system. Validations approaches like cross validation and one left out have been used in the paper. In the cross validation, portion of data is used for training and the other part is used for testing where as in the left one out, one data sample is always kept out for testing and the model is iteratively trained on the remaining. Training utilizes the information from the preparation stage to train the models for the each gesture and recognition, based on the HMM’s, is used to classify the new data using the trained models.
Since it is necessary for the system to understand the relevant continuous gestures for any practical application, authors have proposed the use of rule based grammar for the same. With such a grammar the complex gestures can be explained with the set of simple rules.
This toolkit has been used in various projects being undertaken at Georgia tech and authors have brief details of the same. The first application is development of the gesture based system to change the radio stations by performing certain gestures. The data is obtained by the LED sensors. As a gesture is made, some of the LED’s are occluded which provide the information about the gesture. This information is used to train the model which can then be used for the recognition purposes. Authors have claimed a classification of 249 gestures out of 251 gestures by this approach. Another project introduced is the patterned blinked eye based secure entry. In this project, face recognition is coupled with blinking of eye to generate person recognition model. Optical flow from the images was used to capture the blinking pattern. In this model, it was observed that 9 states in left to right HMM were able to model the blinking pattern. This model achieved an accuracy of 89.6%.Another project deals with the integration of computer vision with the sensing devices like accelerometer and other mobile sensors for sensing the motions. For the correct recognition, different color gloves are used. The features are obtained by both the techniques and integrated into combined feature vector representing a given gesture for recognition process. For this project, they have used 5 states left to right HMM, with self transition and two skip states. This project has not been implemented till the paper was published so no information about the results is available. Authors have also mentioned the use of HMM based approach to recognize the working of the workman in the workshop. They are using the vision and sensors to receive the features which are used to recognize the gestures. Since workman are suppose to perform a series of gestures in order, their model keeps track of their moves and reports an error if they miss a gesture. This system according to them has received accuracy of 93.33%.
Discussion:
This paper just provided an overview of a new toolkit for gesture recognition was being developed at Gtech on the top of HTK developed by Cambridge University. Though, after reading the paper, I was happy that they have something ready for gestures, I was disappointed to find no code on the project webpage, which as per the paper should have been made available as early as 2003. The projects where they are applying the technique also don’t look very attractive as with Bluetooth and wireless remotes, changing channels is much easy compared to making gestures. It is quite possible that a slight unintended occlusion can trigger a channel change. I also believe that voice technology is much more superior now for the purpose. Another project of blinking eye based entry was also something that I did not like. It is not very difficult to copy the blinking pattern and also making and remembering the complex blinking patterns is not easy task. (It is torture to eyes if you have some eye infection [J]). With finger biometrics and retinal signatures establishments are more secure. I have an experience of dealing with the people in the workshop, and I know their motions are very much mechanized and measured to meet the fast manufacturing requirements, but still there are still many unintended motions (after all they are humans), which the presented system can interpret as gestures and provide the alarm signal. Also, it will be really troublesome to work in a workshop with accelerometers on your body which can even affect the efficiency.
Well There is nothing much to say about the paper, if this toolkit is available for download some where it will be helpful and good to have a look at it. May be it can save us from some hard core programming.
Wednesday, February 20, 2008
Television Control by Hand Gestures
For the detection of the hands in the image, they have used a normalized correlation measure between the hand template and the image frame. The location of the hand in the image frame would be the region of maximum correlation. For measuring the correlation, orientation information or the pixel information can be used. However, they have found that the orientation information has proved to be a little bit better.For measuring speed, they have used the derivative information. Stationary background is removed by simply taking a running average of the image frames of the scene ans subtracting the frames. This causes removal of stationary objects in the image frame.For ending the control mode, the user just have to close his hand.
In order to save computational cost in searching for the template in the complete image frame,system finds the position of the best match of the current filter and searches in the local region with the other template to find the position and value of the filter giving the best correlation match.
In order to accidentally press a trigger, they have used a threshold based on the time of non activity. Their system is limited by the limited field of view which is 250 for searching trigger and 15 degree for tracking .
Discussion.
This is a simple vision based approach to track the hand and use its motion for controlling the volume of the television set. Since they are using orientation, I believe if some one keeps hand in the direction which is in slanting direction to the principle axis direction of the camera, it will be difficult to use the template match as for template matching the template should match the shape of the object. Also, I believe , considering the complex nature of the current television controls,it would be difficult to use such a remote control as it might be very tiring. Also, i am unable to understand if the distance from the television also affects the recognition as the size of the hands would change.
Tuesday, February 19, 2008
3D Visual Detection of Correct NGT Production
This paper presents a vision based technique to recognize gestures of NGT (Dutch Sign Language). Many NGT signs contain periodic motion, e.g repetitive rotation, moving hands up and down, left and right and back or forth. As such, it is required that the motion be captured in 3D. For this purpose, user is bounded to be in a specific region with a pair of stereo cameras having wide angle lenses and no obstruction between the camera and the hand so that gestures are obtained clearly. The complete setup and architecture of such a system is shown in figure below.
Since hands are the region of interest it is very important that their motion be tracked. As such authors have proposed a segmentation scheme to segment out the skin from the image frames by using skin models which are trained by training them on the positive and negative skin examples specified by the users. Skin color is modeled by the 2D Gaussian perpendicular to the main direction of the distribution of the positive skin in the RGB space, which is obtained by a sampling consensus method called RANSAC. In order to compensate for the chrominance direction uncertainty, mahalanobis distance of the color to the brightness is measured and divided by the color intensity. This provides a kind of normalized pixel intensity which takes care of very bright regions on the skin cause by varied light sources and their different directions
In order to track the gestures, it is important to follow the detected blobs (head and hands) and their movement. As such, various frames per second are captures to record the sequence and track the blobs in each frame by using the best template match approach. It is important though that the right blob be recognized as right and the left blob be recognized as left. Also occlusion may disrupt the gesture recognition and to prevent this, if there is occlusion for more than 20 frames, blobs are reinitialized. For reference a synchronization point is selected for each which is not considered for training. Various features like 2D angles, tangent of the displacement between consecutive frame, upper lower, side angles etc are extracted from each frame and used as features. The time signal formed by the measurement of the features is wrapped on to the reference point by Dynamic Time Warping which ensures that we obtain a list of time correspondences between new gesture and the reference sign. For classification, a Bayesian classifier, based on the independent feature assumption is trained for each gesture. The classifier is trained on a data of 120 different NGT signs performed by 70 different persons using 7 fold cross validation. Their system achieved an accuracy of 95% true positives and the response of their system was just 50 ms which ensures real-time application.
Discussion:
This is a vision based paper throwing in lot of image processing stuff and classification is basically recognizing of the patterns created by hand and head observed in image frames. Since they are tracking blobs, I believe they have to specify the synchronization point for each person who is going to use the system. Also, if there is movement of head while performing gestures, it might affect certain gestures as head may be confused with hand blob. Also, there should be no object in the background which resembles a blob as it could cause miss- recognition. Also, the system requires that the person be constrained in an arrangement which may not be comfortable. I wonder if they have any user study for their system while obtaining data. As far as image processing stuff is concerned, they have done pretty much good job in taking care of the chromaticity changes by training the system on various illumination conditions and using normalized chromaticity value for each color.
Real-time Locomotion Control by Sensing Gloves
Since such a system requires calibration for matching the finger movements with walking pattern of the character, a calibration process forms the first stage. In the calibration process, the user mimics the motion of the character on screen by fingers. After this, auto correlation between the trajectories is calculated which maps the topology of the finger movement and the movement of the virtual character. It is assumed that the character performs simple movements like walking, hopping, running and trotting.
After determining the calibration, the next step is control process in which, the player performs a new movement by the hand; then the corresponding motion of the character based on the mapping function, obtained during the calibration stage is generated and displayed in real-time. The user can change the movement of the fingers and the wrist to reproduce similar but different motions.
The system was tested using the cyber gloves and the Flocker Birds at the testing stage as P-5 are sensitive to external noise and the sensor for 3D location is not very sensitive, if hand is away from the tower. In their test users were given30 minutes to get familiarized with the system and then they were asked to do certain tasks using keyboard and the sensing gloves. It is observed that the number of collisions using the sensing gloves were less for users though time taken was more.
Discussion:
This is a simple mapping of the finger movements to simple stick type character movements in the virtual world. I can understand that the users took time while working with the sensing gloves as most of us are more familiar with the key board and obviously we would be fast using it rather than the gloves. Also, working with gloves can be tiresome as with key board pressing one key does the task while here we have to mimic each and every step by hand movements. Also in zig zag mazy movements requiring sudden turning, it could be really painful. However, over all it was a nice application for small games. I could also be useful for small virtual tours in 3D space.
Sunday, February 17, 2008
A Survey of Hand Posture and Gesture Recognition Techniques and Technology
This paper is the survey paper dealing with the survey of hand posture and gesture recognition techniques used in the literature. Chapters 3, deals with the survey of various algorithmic techniques that have been used over the years for the purpose. Various approaches in the literature have been classified by the author into 3 major categories:
- Feature Extraction, Statistical models.
- Learning Algorithms.
- Miscellaneous Algorithms.
Feature Extraction, Statistical Models:
This category of methods deals with the extraction of the features in form of mathematical quantities from the available data which is captured through sensors (gloves) or images. The method is further classified into sub categories like:
Template Based: In this approach the data obtained is compared against some reference data and using the thresholds, the data is categorized into one of the gestures available in the reference data. This is a simple approach with little calibration but suffers from noise and doesn’t work with overlapping gestures.
Feature extraction: This approach deals with the extraction of the low level information from the data and combine the information to produce high level semantic feature information which can be used to classify gestures/ postures. The methods in this sub category usually deal with capturing the changes and measuring certain qualities during those changes. The collection of these values is used to label a posture which can be subsequently extended to a gesture. This method however suffers from heavy computational cost and also there should be a specific sequence that should frame a gesture else this method fails.
Active shape Models:
Active shape models deal with the image based gesture recognition systems in which they place a contour on the image which is roughly the shape of the feature to be tracked. The contour is then manipulated by moving it iteratively towards nearby edges that deforms the contour to fit the feature. This suffers from the drawback that it can capture those gestures that can be performed by postures requiring open hand. Also there is very little work in this direction. However with limited gestures meeting the open hand criterion, this method has been found to work in real-time. Also stereo cameras cannot be used.
Principle component analysis:
This method is basically the dimension reduction method in which the significant eigen vectors (based on the eigen-values) are used to project the data. This approach captures the significant variability in the data and thus can be used to identify the gestures and postures in the vision based system. Though this method can be exploited for glove based approaches also, but till that time (1999) only vision based techniques have been exploited. This method suffers from a drawback that there should be variance in the at least one direction. If variance is uniformly distributed in the data, it will not yield the relevant Principle vectors also, if there is noise, PCA would consider it as a significant bias too. Besides this method suffers from scaling in hand size and position, which can be taken care by normalization. Even then, this method is user dependent.
Linear finger Tip model:
This method requires special markers on the finger tips and then segmenting the finger tip motion form the scene image .This motion is analyzed for the possible gesture. This method works well for simple gestures and deals with the initial and final position with good recognition. However, the system cannot work in real time and recognizes a small set due to limited possible finger motions. Also curvilinear motions are not taken into account.
Cause analysis
This method is based on the interaction of humans with the environment and capturing the body kinematics and dynamics. This also suffers from limited gesture sets and no orientation information can be used. Besides, this system cannot work in real-time.
Learning Algorithms
These are machine learning algorithms that deal with the learning of the gesture based on the data manipulation and weight assignment. The popular techniques in this sub category are:
Neural Networks:
This method is based on modeling of the human nervous system element called neuron and its interaction with the other neurons to transfer the information. Each node (neuron0, consists of and the input function which computes the weighted sum and the activation function to generate the response based on the weighted sum. There are two types of NN, feed forward and recurrent. The methods based on this approach deal with the problem of heavy training and computation cost involved offline for training. Also for complex systems, such a model could be very complex. Also addition of each new gesture/posture requires complete retraining of the network.
Hidden Markov Model:
This method has been widely exploited for temporal gesture recognition. An HMM consists of states and state transitions with observation probabilities. For watch gesture a separate HMM is trained and the recognition of the gesture is based on the generation of maximum probability by a particular HMM. This method also suffers from training time involved and complex working nature as the results are unpredicted because of the hidden nature. For the gesture recognition, baki’s HMM is commonly used.
Instance based learning:
Instance is the vector of features of the entity to be classified. These techniques involve the computation of the distance between given data vector and instances in the database. This method is very expensive and for instance recognition, we need to maintain a large database and computation has to be performed for each instance to be recognized even when a given instance is provided for re-recognition.
Miscellaneous Techniques:
The Linguistic Approach
This method uses the formal grammar to represent the hand gestures and postures however limited. This method involves simple gestures requiring the fingers to be extended in various configurations which are mapped to the formal grammar specified by specific tokens and rules. The system involves tracker and glove. This system has poor accuracy and very limited gesture set.
Appearance based models:
These models are based on the observation that the humans are able to recognize the fine actions from the very low resolution images with little or no information about the 3D nature of the scene. These methods involve measurement of the regional statistics of the particular region of the image based on intensity values of the region. This method is simple but is unable to capture fine details in the gesture.
Spatio-temporal vector Analysis
This method is used to track the movement of the hand in the images of the scene and track the motion in the sequence of image. The information about the motion is obtained by the derivatives and it is assumed that under static background, hand motion is the fastest changing object of the scene. Then using the refinement and variance constraint flow field is refined. This flow field is captures the characteristics of the given gesture.
Application of the method:
This sections deal with introduction to the application of the posture and gesture in various domains like:
Sign language: where high accuracy of 90 % have been obtained under some constraints.
Gesture to speech: in this which hand gestures are converted to speech.
Presentations: Hand motion and gestures are used to generate presentations.
Multimodal interaction: Hand gesture and motion is incorporated along with speech to generate better user interfaces.
Human Robot interaction: Hand gestures are used as natural mode to control robots:
Other domains include, Virtual environment interaction, 3D modeling in virtual environment, television control.
Discussion:
This paper presents survey of the various approaches used for gesture recognition with beautiful classification into three approaches for recognition which have been further elaborated with methods involved in each approach. Most of the paper deals with the works done prior to 1999 and the advancements in the past 9 years are definitely worth exploring considering advancements in computational power, better sensors , gloves, vision capturing devices and human touch interfaces. It was interesting to note that the problems about the techniques that we discuss in the class have been known to the research community since last 9 years but still only few have been resolved and that too partially. This is mainly because of the complexities involved in the gestures and also absence of any robust segmentation approach.