Wednesday, January 30, 2008

Online Interactive Learning of Gestures for Human/ Robot Interfaces

In this paper authors have presented an interesting gesture recognition system that can be trained on fly. Keeping in mind the ability of Humans to learn the new gestures by looking at the human teacher, authors have tried to develop a similar system, where a human teacher can make robot to learn the new gesture on fly rather than depending upon the time consuming offline training. As per the authors, even 2 gestures were sufficient to teach new gesture to the gesture recognition system. This is quite close to human way of teaching and interacting. This approach is demonstrated by the authors, by developing a gesture recognition system that can recognize 14 alphabets of the sign language. For this system, they have used 18 sensors Cyber Glove as a feature capture device. They have chosen 14 alphabets such that there is little ambiguity between alphabets and they have avoided use of 6D Polhemus sensors which account for orientation and position information.

Keeping in mind the strength of temporal leaning exhibited by HMM and the highly stochastic behavior of the human gestures, authors have used them as classifiers for the gestures. For the computational simplicity, authors have used discrete HMMs and for that they have used a pre processing technique in which the continuous stream of data is discretized using a Vector Quantization of series of short time FFT approach which is popularly used in speech recognition community.


This approach is basically a sampling technique which required windowing function to capture the prominent (Dominant) frequencies in the window and use them as the features for training the HMM’s. The vector quantizers then encodes the vector and a code book is formed where each vector is represented by a single index. As a new gesture is provided, the features are matched to the codebook entry and assigned to the code book with the lowest distance in least square norm sense. The interesting part of such a clustering (in spectral domain) is that it is not task specific and can be easily applied for many more recognition domains where features are consistent in nature over time, like Handwriting Recognition, Face Recognition etc.

As far as the implementation is concerned, they have used a modified form of HMMs called Bakis HMMs which move from a given state to either same state or state that is within next two states. This ensures that the gestures to be classifieds are simple sequences of motions and non cylindrical in nature.

In order to verify their classification rates, authors have used a confidence measure which measures the misclassification rates. Their results are very impressive as in one sample trial, with just 2 examples, they had 1% error rate which dropped significantly to 0.1% after 4 examples. In another sample trial, the error rate dropped down to 0 after 6 examples from 2.4 % with 2 training examples.

Discussion

This paper presents a new approach of dealing with the gestures when we need to train a system on fly. The method is straight forward and is pretty accurate with small online training. I liked their approach where they tried to find a relation ship between the HMMs modeled for speech and that for gestures. I believe, it makes sense considering the fact that both are temporal approaches in practice. However, in spectral domain, we do have a good chance of noise addition which may lead to wrong selection of frequencies during sampling. Also with windowing, there is another problem associated, which is called leakage. Ideally, after FFT, the function should have 0 at the values close to ±ώ, however, windowing causes the waveform to have non zero values (of significant magnitude ) at the frequencies that are closer to ±ώ and also small magnitude non zero values at frequencies away from ±ώ. This leads to unwanted interferences which may affect the waveform and the spectral information. I am not sure; if this is affect is going to affect the formation of the input vector obtained from preprocessor.

Also, believe the introduction of acceleration based method to segment the gestures is a nice way of dealing with the complexities of natural interaction and understanding as it would be very inconvenient for the people to pause between gestures as it is not natural to them.

Over all I liked the approach and the insight provided as I was not aware of the usage of Spectral HMMs before in any other domain except speech.



(FYI: A good reference for spectral analysis of speech (Click here))

Monday, January 28, 2008

HoloSketch: A virtual reality Sketching/Animation Tool

This paper, considering the technology of 1995 when, it was published presented a nice method of augmenting the 2D drawing to the 3D. Users manipulate the virtual world, which they see using in the big CRT monitor by using a hand held mouse like device, which authors call wand. This wand has 3 top buttons and one side button along with a protruding road which acts like a cursor. To keep the objects in the virtual world look close to the naturally viewed objects, computer calculates a new viewing matrix separately for each eye and the refreshing rate is kept close to at least 13 Hz. This ensures that high resolution close to reality image is observed in the virtual world.

In their effort to extend the 2D menus to the 3D in the virtual world, certain modifications had to be made to prevent occlusion and also to take care of the Fitt’s law in 3D as in 3D wand has to move more distance in 3D compared to one in the 2D.So in order to cope up with the problems, authors proposed another way of using huge pop menu with fade up menus to give way to other menus .All this is manipulated using the wand.

Holosketch has lot of features to explore by the users. There is a drawing mode in which users cab draw objects like cylinder, cones, rings etc and even create another instance of it using the left button and keeping that button pressed, object can be manipulated in the 3D space.Object for edition can be selected by placing the wand tip on it and pressing the middle button to select the object for edition. To support the movement in 3D much intuitive, they have supported the side button to present the grasp like feeling to the user and then working with the object in 3D. In order to prevent the accidental pressing of the button by the user, it is necessary that the user presses simultaneously the additional key board keys. The Holo-sketch supports three movements: rotational, orientation, rotation + orientation. They are also bound to the designated key board keys. Authors have highlighted the need to take care of the jitters and noise addition because of them and proposed a use of some kind of 10X reduction mode which provides finer control. The interesting part of the Holo-sketch is elementary animation like rotation, angle, movement between positions, scaling and oscillations.

Discussion:

This paper presents a nice method to sketch in 3D and also add some animation to the creations. It is interesting that the system gives close to real-time observation as system is calculating matrix for each head movement. I believe with 1995 hardware, it is indeed a good step, however use of the CRT monitor doesn’t address to the usage of the system in the real world scenarios. Usage of virtual 3D projection technology along with augmented reality glasses would be a nice addition to the interface. Also, I believe there should be a kind of erase mode where user can erase a small part of the sketch in 3D as it will be challenging to do so in 3D considering occlusion cases. Also in this paper I wondered what is the 10x mode? There is actually no reference for that. However, I believe, for a device like wand, taking care of jitters was really a good approach though, I was unable to understand it.

Sunday, January 27, 2008

An architecture for gesture based control for mobile robots

This paper presents an interesting method to control a mobile robot by using the hand gestures. Authors have stressed that it is very important that the robot should be able to interpret the meaning of the action rather than just imitate the action Hand gestures have been described in the paper as a natural and rich input modality to interact with the robot.

The complete system consists of a mobile robot, a CyberGlove, a Polhemus 6DOF position sensor and a geo-location sensor that tracks the position and orientation of the mobile robot. Apart from this there are two servers, Geo-location server and the Gesture server that are communication with each other. The task of the Geo-Location server is to keep track of the position and the direction of the mobile robot in 3 by 6 universal coordinate system. The role of the Gesture Recognition server is to interpret the gestures of the user, which are captured using the cyber glove and the Polhemus sensors, and then provide the interpretation to robot so that it can act on the input provided. All the components of the system are integrated within CyberRAVE which is a multi-Architecture robot positioning system for distributed robots and these servers communicate using the CyberRAVE interface. In order to recognize the gestures, authors have used a temporal HMM approach in order to take advantage of the temporal nature of the gestures. Instead of using the sensory information from all the 18 sensors, they condensed the feature vector from 18 dimensions to 10 features by linearly combining certain responses. This feature vector is then augmented by the first derivative of the 10 dimensional feature vector obtained in the previous step to obtain a 20 D column vector which is reduced to a single dimension codeword using the famous vector quantization. Authors, after examining the level of detail required for correct interpretation of the actions, chose 32 codewords. The code book for these codewords is trained offline where, they experimented with 5000 measurements which, as per them, captured all the possible samples of gestures and non gestures covering the entire span of the hand space. This set is them partitioned onto 32 final clusters and the centroid of the cluster forms the final codeword for the gestures in that cluster. These 32 code words are then used to define the 6 final gestures as a sequence of the codewords. The selected gestures are:

Opening: Moving from close fist to open hand.

Opened: Flat opened hand.

Closing: Moving from a flat open hand to a closed fist.

Pointing: Moving from the flat open hand to index pointing, or from a closed fist to index finger pointing.

Waiving Left: Fingers extended and waiving left.

Waiving Right: Fingers extended and waiving right.

HMM’s being a learning based models are bound to converge to a gesture interpretation even if the gesture does not imply anything. In order to prevent this, additional state called the wait state is introduced which, is the node state and there is an equal transition probability to all gesture models and itself. As an observation is made, the probability of being in the state is updated for all existing states and is normalized to 1. This model ensures that for all non-identified gestures, probability of being in the wait state is “maximum”. Thus, all unwanted gestures are eliminated from being recognized as one of the selected gestures. In case of correct gestures, the model which represents that gesture, would yield the highest probability and hence would be selected as the interpretation of the gesture.

Discussion:

This paper presents a beautiful approach of using the hand gestures as the mode of communicating with the mobile robot. We humans use hand for communication when words fail to convey the meaning and augmenting the same approach, authors have developed a beautiful way of controlling the actions of the robot. Though, it is controlled environment communication, I consider it as a good step towards more complex systems. I believe, instead of using the ccd camera for the geo-location purposes, GPS can be used along with the magnet and gyroscope on board to convey the geographical and orientation information to the server which can communicate with the gesture interpretation server. This will give more mobility to the robot as well independence from the controlled environment. Also, onboard stereo cameras, along with the IR and ultrasonic sensors can be used for controlling the local motion of the robot. I believe, additional joystick can be used with the other hand to switch between the two modes when required and also to control the orientation of proposed on board stereo cameras.

Wednesday, January 23, 2008

An Introduction to Hidden Markov Model

Author has presented a beautiful short tutorial on one of the best time series analysis model used in machine learning called hidden markov model. Hidden markov models tend to explain the process that caused a certain response without knowing about the underlying procedure but by just looking the observations. In other words, in HMM, underlining process that generated certain response is not known but by looking at the sequence of generation of response, the process is predicted. This is a stochastic process which similar to the markov chains involve states transition probabilities, however unlike markov chains, each state is not fixed and is capable of generating outcome of any of the states in the model with certain probability. The capability of an HMM to generate the observation probability (which may be discrete or continuous) gives additional ability to the HMMS to generate any possible state observation in the process.

A HMM is represented by:

T = Duration of observation sequence

O = Observation sequence of T observations

M = number of observation symbols
N = number of states

Q = set of N states
V = set of M possible observations
A = state transition probability
Bt(i) = observation probability distribution for a given state ‘i’ at time t
π = initial state distribution


Usually HMM is represented as λ = (A, B, π).

After the basic introduction of the HMM, author introduces us to the three problems of HMM. The first problem of HMM is finding the probability of a given observation sequence given the model λ which is represented by P(O| λ). The second problem of HMM is estimating the optimum state sequence which is done by the ‘veretbi algorithm’ using the forward or backward procedure (which is mainly an induction procedure where track of optimum sequence till a given time‘t’ is kept). The third problem of HMM is the problem of estimation of parameters which is basically done by Baulm Welsh algorithm which is basically a form of the Expectation Maximization algorithm. (FYI: The third problem is considered to be the most difficult problem in HMM).

After discussing about the problem, author introduces us to the different structures of HMM like ergodic in which each state is revisited more than once and then the non ergodic models where constraints are put on the transitions. A good example of non ergodic model is the left right model in which transition happen either in Left or Right direction.

The author also highlighted to us the issues dealing with the training of the HMM’s and warned us that if there is no occurrence of symbol in the training set, it will lead to the 0 probability and hence the model may fail. As such sufficient training should be made available to deal with the issue.

Discussion:

This is a classical paper on HMM and short form of much detailed ‘Tutorial on HMM’ which is one of my favorite papers. I don’t find any drawbacks in the method except that HMM’s are too much training dependent like most of the Machine learning Algorithms but they have been proved to perform much than others in time analysis problems.

Environmental Technology: Making the Real World Virtual

Reading this paper brings to my mind some of the sci-fi fiction movies where machines and humans are interacting with each making world more comfortable and machines more useful. Author talks in similar terms about integrating computers and their power in the daily life of the human beings. Author suggests that the interaction between the machine and humans should be natural and comfortable and he has mentioned about Sutherlands head mounted device which he rejected as it was not comfortable. In order to demonstrate his intention, he worked under a framework where the environment interacted with the user based on his actions. He also talks about another of his systems where, two geographically separated individuals were able to communicate naturally by superimposing the data image of the user over the computer graphics which were controlled by the data tablet. In another form of such a communication system, the hand images of the users were superimposed on computer graphics so they are able to use their hands to communicate as it they were sitting together. Author also mentions about usage of projection displays to create a virtual world where, the subject can fly by leaning towards the direction in which he would like to fly. Another interesting application in the paper was VIDEOPLACE which enables users to interact with the 2D projection of the images in a special room. VIDEOPLACE was improved to VIDEODESK where interaction occurs in three dimensions which could be helpful in 3D sculpting. The paper also about the system where children were animated in a graphical scene to learn the concepts while experiencing it .


Discussion


This paper presents various efforts by the author during his research career in bringing computers into the daily life of humans through interactions. I believe this is the common goal in the Computer human interaction to use the existing technologies, improve them and make them more usable for the masses. This paper for me is motivation about how simple concepts and ideas of the existing technology can be manipulated keeping in mind the psychology and perceptual knowledge about the human brain to build more useful computer technology.

American Sign Language Finger Spelling Recognition System

This paper presents a simple way of recognizing the sign gestures which after being recognized can be used as an input to a speech engine or text editing software for speaking or displaying the alphabet. The authors have used 18 sensor cyberglove to capture the sensor response for the sign language gesture for an alphabet and trained a neural network (perceptron) to recognize the alphabet corresponding to the gesture. The input to the neural network is the 18x24 matrix which represents the sensor response of 24 alphabets (except J and Z) and also a 24x24 identity matrix which represents the targeted Alphabet. This input is used in to train a Neural Network using MATLAB toolbox and then, for real-time usage, integrated with Lab view (a product developed by National Instruments for real-time applications).

In order to recognize a gesture, user makes the sign language gesture corresponding to the alphabet, which is then fed to the Neural Network framework running on Lab view. This framework responds back with a 1x24 matrix with ‘1’ at the position of the alphabet corresponding to the gesture interpretation by the neural network.

Discussion:

This is a pretty straight forward and simple approach to recognize the sign languages. However the limitation is that it is very user dependent and it works only, if the network is trained from the data obtained by the user who is intended to use the system. Another drawback is the omission of ‘J’ and ‘Z’ which makes it incomplete. It would have been better if some other gestures are used for J and Z which can be taught to the intended users. Also, neural network is very prone to noise and it would have been better, if the authors would have tried to add artificial noise while training which would have made training robust enough and might have increased accuracy. I feel for besides the alphabets, there should be some gesture for deletion and break also as then we can use the system to recognize alphabets which will form words, completion of the word can be represented by the break gesture, and then it can be fed to the speech generation system which can speak the word. Before using break, we can edit the word using the delete gesture which can delete an alphabet.

Flexible Gesture Recognition for Immersive Virtual Environments.

This paper presents a method based on gesture recognition to interact with the virtual environment in 3D. Highlighting the cons of the current means of Gesture Recognition, author emphasizes that it is best to have an interacting system which is more natural and doesn’t require special clothing, background and other location based constraints for recognizing gestures. As a solution, they have proposed usage of an inexpensive hand glove based devices to interact with the virtual environment emphasizing the role of hands while interacting even with the natural environment. Authors have highlighted that though the hand gloves are good for interaction, the problem lies in measuring their orientation and position in the space. With the cheaper P5 gloves (which authors have used) the position in space cannot be found accurately as the position sensor is an infrared sensor based on reflection of the IR radiation and in the professional glove, does not have a location sensor and requires additional Electromagnetic radiation based Flocker birds to locate the position of the gloves in the space. Apart form this, professional gloves require additional wire that connects the glove with the device and send back the sensor information. Though, this advance set up is able to give best estimation of position, it is cumbersome to use and affected if there are metallic surfaces in the vicinity of the usage area. Keeping these problems in mind, the authors have used gestures based on the flexion information and incorporated additional information about the orientation for defining gestures and eliminated the usage of gestures which require hand motion in space over time. Since between gestures there are always some unintended movements which are not the gestures, to deal with them, they have added time constraint. Thus with this time constraint only the gestures that are held static for some predetermined time are recorded and rest non gesture movements are discarded. This method helps in constructing complex gestures formed by multiple small gestures. Based on the discussion above, authors have defined gesture as sequence of succeeding postures.

Gestures are recorded in form of a 5D vector for finger flexions, were each dimension corresponds to the sensor value received from the P5 glove, orientation information and another value, indicating the relevance of orientation. In order to recognize the gestures, they have framed a gesture manager which has a template for each gesture defined by the authors. The template is in form a response from the P5 finger sensors which are framed as a 5D vector, based on the flexion values corresponding to the particular gestures. In order to deal with the variability in the gestures (even by the same person), they have used a gesture averaged over several similar gestures by the person, as a template for a gesture. Each gesture corresponds to some identity which can trigger an event. For the recognition, the input value is obtained from the glove and compared against the templates to measure a distance metric. If there is a gesture in the library, which corresponds to the minimum distance metric keeping cognizance of the defined thresholds, then orientation is compared and if that too is within the defined threshold, the input gesture is recognized. After the gesture is recognized, the identity associated with that gesture triggers the associated event. If there is no match, no gesture is returned.

Discussion:

This is a simple paper presenting a beautiful way of interacting with the virtual environment. I liked the way they have explained the previous works on image based gesture recognition and the problems associated with them and also the problems with the glove based gesture recognition. Based on only the flexion values and the orientation though they have eased out the complexity of the problem, but still the solution is very affective to deal with the simple interactions. But this method lacks a lot. First of all, since we are dealing with just orientations and flexion values, there should be a proper position of the infrared receiver tower and the location of the hand for both training and testing as change in any of the position would add errors to the input as different positions (usually at an angle to the receiver) would add different phase to the reflected infra red (IR) radiation and thus different response will be recorded. Also, we have to be very close to the tower to get orientation feedback as the IR responses are not very strong at larger distances. Secondly, I believe that there are fewer gestures that look very different for each other while using just the fingers. So there is a good chance of misclassification in similar looking gestures. Another drawback of the paper is that they have not mentioned the distance metric they have used, I believe simple Euclidian distance is not a good measure of similarity as larger response of even a single sensor may lead to larger over all distance even though all other responses may lead to smaller distances. May be a normalized measure of distance can provide a better solution. Apart from this, the biggest drawback is that there is no information about the performance results and the gestures that they have tried and if some gestures caused some ambiguity. I would be also interested to know if we can look for a totally user independent way to interact with the virtual environment by using some normalized sensor responses as input.