Wednesday, April 23, 2008

Toward Natural Gesture/Speech HCI: A Case Study of Weather Narration

Summary:

This paper presents a gesture recognition system which can capture the motive behind the gesture rather than the description of the gesture for recognition. This means that if the user is pointing back, he can point back in any intuitive way as he likes. The system consists of the HMM which is trained on the global features of the gesture taking head as the reference (as shown in the diagram below).Color segmentation was used to extract out the skin which was tracked using the Predictive Kalman filtering. Authors have also suggested that in certain domains , certain speech follows the particular gesture which can be used as another feature to remove ambiguity in recognition.Authors observed that in their weather speaker domain, 85% of the time meanigful gesture is accompanied by a similar verbal word.. Using this knowledge, the correctness was improved from 63%-92% and accuracy to 62%-75%.



Discussion:


I liked that hey have pointed out the correlation between the words and the accompanying gestures. I think such a system which uses context (derived from words) would be a state of art development in the field of gesture recognition. They results (Though I am not convinced about their approach) demonstrate that context can improve accuracy. However, I am not happy with their too simplistic approach as lot of ambiguity can be introduced in such a system because of 3D nature of the problem and recognition using angles in 2D plane. Considering simple weather example, this may work to some extent, but results show that there is not much to expect from such a system in more complex situation.

3D Object Modeling Using Spatial and Pictographic Gestures

This paper presents the merger of graphics and haptics to create objects in the augmented reality setup. The authors have given a brief introduction of the superquadrics (super ellipsoid) and how they are used to create various shapes in the graphics. In the paper they have used the hand gestures to map to certain operations like tapering, twisting, bending etc which are mapped to the graphic object using various functional mappings. Using the Self organizing maps, the system allows the users to specify their own gestures for each operation. In the example, the creation of primitive shapes is the first step. The primitive shapes are then used to build up the complex shapes like the teapot and the vase.The system was also tested to see if the non existing objects can be created from hand and how precisely the real objects can be modeled.

The system consists of two cybergloves , aPolhemus tracker and a large 200 inches projection screen where the objects are projected. Using the special LCD shutter eye glasses, user can see the objects and manipulate with them using his hands and pre-specified gestures.


Discussion:

I am not sure how easy or difficult is it to manipulate with the objects in the augmented reality.I believe if there would have been some haptic feedback, it would have been more natural.However, I liked the paper because it was different and had very practical application. Such a system can also be used by multi-user design team to create designs in the virtual world by interacting with the object together.

Device Independence and Extensibility in Gesture Recognition

In this paper a multi-layer recognizer is presented which is claimed to be device independent recognition system that can work with different gloves. The data from the gloves is converted into the posture(like is bend, is open ), temporal (Changes in the shape of the posture over time),gestural predicates(Motion of the postures). The predicates basically contain the information about the states of the fingers . Thus a 32 dimensional boolean vector is obtained through a feed forward neural network which is used to match with the Template.The template matcher works by computing the Euclidean distance between the observed predicate vector and every known template. The template that is found to be the shortest distance from the observed predicate vector is selected as the gesture with certain confidence value. Confidence values are basically weights which are used to weigh certain predicate so that the information from all predicates is equally utilized (as some predicates may give less information while others may give more).

They have tested their system on ASL (excluding letter J and Z) and have reported the accuracy in high 50's and mid Sixties by varying the sensors availabe in 22 sensor Cyber glove. They also reported that there was increase in accuracy when bigram context model was added to the recognition.



Discussion:

There is nothing great in the paper. using the predicates is an easy way but there is always scalability issue when users change. This method will not be suitable for multi-user environment. Even a single user would not be able to make this system to get higher accuracy because of intrinsic variability in the human gestures. That is why accuracy is very less. Also, using the less sensor data from similar glove does not make glove a different glove. As I understand different glove means altogether different sensors and architecture. I do not agree with their claim of device indpendence as they have projected in the paper.

Wednesday, April 16, 2008

Discourse Topic and Gestual form

In the paper authors have tried to find out the extent to which the gestures made are dependent on the topic (i.e. are speaker independent) and to what extent the gestures made are dependent upon the user. The presented frame work is a Bayesian framework which utilizes the unsupervised technique for quantifying the above mentioned extent. They have used vision based approach for the same.

Their approach utilizes the visual features that describe motion based on so called spatiotemporal interest points, which are actually a high contrast image regions like corners and edges that undergo complex motion. From the detected points, the visual, spatial and kinematics characteristics are extracted to frame a huge feature vector on which PCA is applied to reduce the dimensionality. The reduced dimensionality feature vector is used to form a mixture model and a code book is obtained. The dataset consists of the 33 short videos (duration 3 minutes) of the dialogues involving 15 speakers describing one of the five pre determined topics. The user set age ranges from 18 -32 and were native English speakers.

In the experiment each user was allowed to talk to another speaker though they were not asked to make gestures. The scenarios involve describing, “Tom and Jerry” and mechanical devices (piston, candy, pinball machine and a toy). It was observed that of the recorded gestures, 12% of gestures were classified as topic-specific with correct topic labels and with the corrupted labels this value dropped to 3%. This indicated that there exists a connection between discourse topic and gestural form which is independent of speaker.

Discussion:

12% gestures are classified topic specific if given with correct labels and 3% are classified correctly with corrupted labels. And to convey this message instead of using simple human judge to take a note, complex machine learning with vision based approach was used which they admitted must be bit corrupted because of computer vision errors. I am not very impressed with the paper.

Monday, April 14, 2008

Glove-TalkII - A Neural-Network Interface which Maps Gestures to Parallel Formant Speech Synthesizer Controls

This paper presents a gesture based speech synthesizer called glove-talk-II that can be used for communication. The idea behind the whole work is that by recognizing the subtasks involved in speech generation (tongue motion, alphabet generation, sound used), they can be mapped to suitable actions mapped to the sensor devices The device consist of a cyber glove (18 Sensor) for gestures, a polhemus for emulating the tongue motion (up down) for consonants, a foot pedal for sound variation. The input from all the three sensors is fed to 3 different neural networks trained on the data for that particular subtask.

There are three networks used which are: Vowel/Consonant network, Consonant network and Vowel network. A Vowel/consonant network is responsible for recognizing if the emit the vowel or consonant sound based on the configuration of the user hands. Authors have claimed the network can interpolate between hand configurations to produce smooth but rapid transitions between vowels and consonants. The network trained is a 10-5-1 feed forward network with sigmoid activation function. They have trained it using 2600 examples of consonants and 700 examples of vowels. Training data was collected from the expert user while vowel data was collected from the same user by requiring him to move hands up and down. The test consists of 1614 examples and MSE was reduced to 10^-4 after the training.

The vowel network is a 2-11-8 FF network and hidden units are the RBF’s which are centered to the respond to one of the cardinal networks. Outputs are 8 sigmoid units representing 8 synthesized control parameters. The network was trained on 1100 examples with some noise added. The network was tested on 550 examples and gave MSE of 0.0016 after training.

The consonant network is a 10-14-9 network with 14 hidden units as normalized RBF units. Each of the RBF unit is centered at Hand configuration determined from the training data. The 9 output units are responsible for the nine control parameters of the formant synthesizer. The network is trained on 350 approximants, 1510 fricatives and 700 nasal scaled data. Test data consists of 255 approximants, 960 fricatives and 165 nasal data giving MSE of 0.005.

They tested their product on an expert pianist and with 100 hours of training, his speech with glove-talk-II was found to be intelligible though still with errors in polysyllable words.




Discussion:

I just wonder how difficult / easy is to control your speech with foot pedal, data glove and polhemus all of which need to move in synchronization for correct pronunciation. Also with 100 hours of training results were intelligible and I doubt if this system is comfortable at all. But for a paper 10 years old, it is a decent first step and a different approach.

Sunday, April 13, 2008

Feature selection for grasp recognition from optical markers

This paper presents the feature selection methodology to select the relevant features from a large feature set for better recognition results. The aim of the project is to obtain a gesture class for an input using the minimum features. In the approach, the authors have used several markers on the back of the hand and used the calibrated cameras to track the markers in the controlled environment. They have supported the use of this method because this approach doesn’t affect the natural grasping of the subject. Also, it doesn’t affect the natural contact with the object.
In their experiment they have placed markers on the back of the hand and used the local coordinate system which is invariant to the pose.

In order to classify the gestures they have used a linear logical regression classifier which is used for subset selection in a supervised manner. They have used three markers a t a time as a single feature vector and then used across validation with subset selection with the number of features to arrive at the best feature set. They have tried both the backward and the forward approach for the subset selection and observed that the error between the two approaches is just 0.5%.

In their experiment, they have used a 90 dimensional vector for a hand pose representing the 30 markers on the hand. Their domain for the experiment is the daily functional grasps shown in the figure below.

They collected their data using the 46 objects which were grasped in a multiple ways. The objects were divided into two sets A and B, A containing 38 objects with the 88 object-grasp pairs and B containing 8 objects with 19 object grasp pairs. They collected data from 3 subjects for set B and 2 subjects for set A. Then they used 2 fold cross validation and with full 30 marker data obtained accuracy of 91.5 % while with 5 marker selected by subset selection they achieved 86% accuracy.

They also evaluated their feature subset on classifier trained on different data using both 5 and 30 marker set. They trained in total 4 classifiers (2 with the data from subject 1 and subject 2 respectively from object set A, 3rd one with the combined data of subject 1 and subject 2 from object set A and fourth on combined set A+B from subject 1 and subject 2).

They observed that the accuracy was sensitive to weather the data from the subject is used for training or not. With data included the accuracy for the user was 80-93% in reduced space (5 markers) and between 92-97% (with 30 markers). With totally new user accuracy tested on single user trained data, accuracy was abysmal 21-65% for reduced marker space. However the retain-ment of accuracy was over 100% in all cases.

In their analysis they also observed that the grasps for all the three subjects did well for cylindrical and pinch grasps however spherical and lateral tripod performed poorly because of the similarity between three finger precision grasps.


Discussion:

This paper has nothing new except the new complex linear logical regression classifier. Their analysis is also based on small user set and hence cannot be generalized to most of the cases. I think with more users with different hand sizes, it would have been a better paper. Also, I don’t understand why many papers claim that the accuracy increases with the samples from the user included in the training set. I think it is very simple and easy to digest fact which needs no explanation. Also, it would have been nice if they could have mapped the relation between the grasping patterns of the users, which might have been used for making more generalized set of features for a given set of users sharing similar patterns. The work is very similar to our paper on sketch.

RFID-enabled Target Tracking and Following with a Mobile Robot Using Direction Finding Antennas

Summary This paper presents novel application of RFID for robot target tracking and following. Authors have utilized the cheap though effective RFID technology for the purpose. These antennas receive the signal form the RFID transponder which generates the voltage which depends on angle of orientation of the antenna (sin of the angle)with respect to the transponder. This variation in the voltage can be used to find the direction of transponder .In order to find the direction correctly; they have used dual antennas which are at 90 degree phase difference. S The ration between the voltage can be calibrated to the angle information of the transponder (as the v1/v2= tan(theta), where theta is the angle between the transponder direction and the line of sight of the transponder with respect to the vertex of the angle formed by the two antennas.

The set of antennas is connected to an actuator, controlled by a micro controller, which is programmed to rotate the antennas such that the ratio (v1/v2) remains 1. This ensures that the transponder is always in the direction given by the angle bisector of the two antennas. The change in the direction triggered by the actuator is mapped to the actual direction of motion of the robot. Thus the robot can be made to track either another robot or reach some static destination.


Discussion:

This paper as discussed by all is more relevant to Robotic navigation and Rather than RFID. Though, it has given some good cue about utilizing the information based on the voltage ratios for direction tracking. May be we can club this information with the data gloves to get the 3D information about the hand position. I think if we can use a similar setup (orthogonal antennas) as a receiver and some RFID transmitter (tags) on the gloves, the ratio can be utilized to track the hand in 3D space with respect to the vertex of the receiver. Over all I liked this simple and straightforward approach.

Gesture Recognition Using an Acceleration Sensor and Its Application to Musical Performance Control

This paper presents a method to recognize the gestures based on the acceleration data and its application for the musical performance control. Authors have argued that the emotions and the expression are very much dependent on the force with which a gesture is performed rather than just the gesture. As such, they have used the accelerometer, placed on the back of the hand, to capture the force with which gesture is performed. This information is also utilized to recognize the gesture by analyzing the components of the acceleration vector in the 3-Planes (x-y,y-z,z-x) and capture the temporal change in the acceleration.
Using the temporal information available through the time series acceleration data projected in to the planes, 11 direction parameters (1 intensity component, 1 rotational component, 1 main motion component and 7 direction distribution components (computed by measuring density) ) are computed for each plane giving 33 parameters for the given gesture.These 33 parameters are then used for the recognition of the gesture.

For the recognition purposes, the given gesture samples are collected from the user and then using the difference between the gesture acceleration with the mean of the standard patterns, normalized with the standard deviation is computed which gives the error measure (weighted error). The gesture is recognized as the one belonging to the standard pattern, which gives the minimum weighted error.

They have tested their approach for generating music based on the gestures of the condutor. They have also used dynamic linear prediction for predicting the tempo based on the information of the previous tempo. This according to them gives a realtime results comapred to the image processing based approach.

They have claimed results meeting 100% accuracy with same user while performance declining with different user

Discussion:

This paper presents a simple though interesting utilization of the acceleration data for gesture recognition.However, I think that a given user can perform the same gesture quite differently(based on acceleration data) depending upon the energy and enthusiasm, so their approach may fail even for a same user (for whom system has been trained). I guess they are using some threshold to determine the start of the gesture and then some threshold to mark the end. I believe,the initial start will have an abrupt acceleration while will become kind of constant for the rest of the gesture and then for change of gesture, there will be abrupt change in acceleration (direction + Magnitude) finally leading to decrease in acceleration leading to final stop. The changing values of acceleration can be used for segmenting the gesture.

I am not sure if they have used a series of gestures or just one gesture at a time for their experiment. Also, it would be interesting if they use multiple user data (normalized) for the training set and then check the accuracy for the multiple users.

Thursday, April 3, 2008

Activity Recognition using Visual Tracking and RFID

Summary:

In this paper, authors have presented the use of computer vision and RFID information for capturing the activity of the subject. Using the standard methods based on skin color based segmentation, they obtain the skinny region (hands face). By using the information about the area of the bounding box, they recognize the hands and track them.

All the objects are having a RFID tag which is read through a RFID reader. The RFID reader consists of an antenna which emits the enery that is used by the RFID tag to get charged. The RFID antenna is periodically switched off during which, the capacitor in the RFID tag looses charge which communicates the ID of the Tag back to the reader by modulation of the energy. The authors have modified the above mentioned mechanism is modified in which the reader is provided with the aditional capability of capturing the voltage values emitted by the RFID tags.

Since the tag would receive maximum signal when it is normal to the antenna field. The variation in the energy received and emitted can be used to capture the orientation information about the ibject being manipulated by the user.

In a nut shell:

RFID tag is used to capture if the object is being manipulated by the user or not and what is the object being manipulated.

Computer Vision is used to obtain the position of the hand. If they are close to the object being manipulated (measured through RFID), it is said that user is approaching the object.


Tracking his hands along with the orientation information from the RFID tags is used to recognize activity. As an example they have shown the activity of a subject in a retail store environment.


Discussion:

This is really a nice and novice method of recognizing activity and may be it can be helpful in Dr Hammond's favorite "wokshop saw" activity also . But my concerns are:

1. If RFID can provide a good estimate of the distance.
2. If they are unaffected by presence of other magnetic devices in the proximity.


Over all it is a nice approach and as discussed in the class is a nice and cheap direction to look at for future research.

Gestures without Libraries, Toolkits or Training: A $1 Recognizer for User Interface Prototypes

Summary:

The authors have presented simple yet robust method of recognizing single stroke gestures (Sketch gestures) without use of any complex machine learning algorithm. The idea behind the complete $1 recognizer is simple:

  • Capture different ways of drawing a shape.
  • Re-sample the data points so that over all shape is preserved.
  • Rotate the sampled points by an indicative angle, which is defined by them as an angle between the gestures first point and the centroid. According to them this indicative angle rotation helps in finding the best match.
  • Then the samples points are compared against the available templates on the database.
  • The closest matching template is the recognized result.

Since sampled points are dependent on the drawing style of the user, the two similar looking shapes may have different template matches because of the variability in drawing style. This is eliminated by providing the database with all possible drawing style samples.

They have compared their algorithm with the Rubine algorithm and have reported better accuracy than Rubine. With 1 template, they have obtained accuracy of 97% which improved to 99% with 3 templates per gesture.

Discussion:

I have same thoughts like Brandon. being an instance based algorithm , if your database is too large , this would not yield faster results. This means that when we present the gesture to the system, it does all the pre processing and searches through data set.Even if we repeat the same gesture, it will search the complete data set again .It does not keep a notebook record of expected gesture even when gesture is repeated again. In other words , it is instance based algorithm with memory erased after each search is completed. May be by some book keeping we can improve the algorithm. Other wise it is a beautiful and simple algorithm .

Enabling Fast and Effortless Customisation in Accelerometer Based Gesture Interaction

This paper presents use of the acceleration data for the gesture recognition using the HMM's. The gestures used are used for controlling the DVD player controls.They are using the accelerometers to capture the acceleration values and as per them these signal patterns can be used in generating models that allow the recognition of gestures using an HMM. The process involves:

1. Using the sensors to obtain the accelerometer data.
2. Sampling the data again and normalizing the same equal to equal length and amplitude. The data is reduced to 40 sample points per gesture.
3. Sending the data for a gesture to the Vector Quantizer to reduce the dimensionality of the data to 1-D. (Assigning labels)
4. Then the vector quantized data is used to train the HMM which is then used for the recognition.

For their training, they are adding artificial noise to introduce variability and thus increasing their training data set. They found that with SNR=3 Gaussian noise, data achieved best accuracy. They collected 30 distinct 3D acceleration vectors from one person and selected 8 gestures for the control. Then they added noise to introduce variability and get additional data. They then used the data for cross validation to obtain the best training/ testing set . They got an accuracy of 97.2% with SNR=3.


Discussion:

This paper was strange for me. I don't understand, If I am correct they are using the same data for vector quantization (after adding nose too) and then they are testing on part of the data. And after using vector quantization why are they using HMM's. Also addition of noise cannot introduce the variability that different user can introduce, so their accuracy means nothing to me.