Sunday, March 30, 2008

SPIDAR G&G: A Two-Handed Haptic Interface for Bimanual VR Interaction

Summary:

This paper presents a Haptic device called SPIDAR which is used to interact with the virtual world. The device consists of the ball in the center which is attached to different pulleys through strong nylon threads. This system provides user with 6 DOF and 7th is the grasp which is provided using a pressure sensor on th ball.

The interaction with the virtual world is through the movement of the SPIDAR ball in the restricted environment (each SPIDAR corresponds to the single object)provided by the arrangement. The motion of ball in a given direction creates tension in some of the strings of available direction. This tension drives the pulley against resistance provided by the motor. This motion is captured and used a the input for the motion of the object associated. Authors have stated thats such a system can be beneficial for use in tele-operation, medical operations. molecular simulations etc.

The system of two SPIDER was tested on three users, where each of the user was assigned a task of controlling a sphere in a virtual world with one hand (SPIDER) and using the other hand (SPIDER) to touch a pointer to the marks on the sphere . They observed that people liked SPIDER -G&G (bi modal version) compared to the SPIDER-G (single mode version) because the bi modal version seemed much intuitive.Also they found that users were able to perform better when provided with haptic feedback.


Discussion:

This device is a self developed device by the authors and has nice combination of the mechanics of strings and computer manipulation of the data. Though device has a good feedback, movement of the ball is restricted by the strings as they may interfere with each other. Also, we have to apply a balanced force to interact with the system as the system is not fixed and may fall down with more force and with less force may not give desired result.

However the cost involved in such a system is a limiting factor and also no new work has been reported which limits my knowledge about the current state of the system. Also since there is one SPIDER per objects,the interaction is very limited. May be with some kind of switch single SPIDER can be used to interact with the other objects with just the press of the switching switch. Also, it would be interesting if the similar objects can be grouped together and then the single SPIDER can be used to manipulate with them in virtual world.

Since I have personally used the system, I believe it is one of the stand apart application and very useful in terms of interaction response provided by the device.

Gesture Recognition with a Wii Controller

Summary:

  1. Get a wii Controller and obtain the acceleration values.
  2. Filter the values and remove some of the redundant values
  3. Use Vector Quantization (K-means to form clusters)
  4. Feed to HMM with Bayes classifier
  5. Get the results (90%)

Discussion:

Nothing to discuss as I just see it as application paper.

Taiwan Sign Language (TSL) Recognition Based on 3D Data and Neural Networks

Summary:

This paper presensts a Neural Network based approach to recognize the 20 Taiwanese Sign Language Static Gestures. They have proposed a neural network (Back Propgation NN) for recognizing the 20 static Taiwanese gestures that are captured using a vision based capturing device called VICON. Using markers on the dorsal surface of the hand, they capture the features for the given gesture. The given gesture features are actually distance measures of the marker positions relative to some reference. The distances are normalized to take care of the variable hands and then used as the feature inputs to the Neural Network. The neural netwok is trained on the similar data obtained from the users. Authors have reported that they have used data from 10 students which repeated each of the 20 gesture 15 times thus providing 3000 data samples in all. Out of 3000 data samples, 212 were reprted to have missing values and were not used. Out of the rest 2788, 1350 samples were uysed for training and 1438 were used for testing.

Their NN architecture consists of 15 input Neurons and 20 output neurons aong with 2 hidden layers. With 250X250 neurons in the hidden layer they have reported accuracy of 94.5 on the test data while on the training data it was 98.5 (not important). 15 input neurons are for the 15 feature vectors used and 20 output neurons provide the output probabilities of each of the gesture.


Discussion:


This paper was simple and straight forward. The way of obtaining the features was simple considering only the distances between the markers were chosen and also gestures chosen had no occlusion. Considering the statis gestures, I think this is the problem in 2D rather than problem in 3D as the 3rd dimension is / can be always constant for the static gestures. The training and testing data is obtained with much care which may affect the recognition , if input data is taken from outside the users without much instructions. There is not much take home message from this paper except their way of obtaining the distance metrics that can be used as features.

Hand Gesture Modelling and Recognition Involving Changing Shapes and Trajectories, Using a Predictive EigenTracker

Summary:

This paper presents an approach of gesture recognition using the vision based techniques. This technique boasts of no training involved like HMM's ,hence faster adaptability to new gestures . The only requirement (which is kind of not nice) is that the gestures should be well distinguishable. The algorithm obtains the affine transforms of the image frames and projects the image to the Eigenspace. In the Eigenspace, since only hand is moving, while the background is stationary, the first few PCA components would capture the maximum variance,i.e in fact the motion of the hand in each frame.

This method is inspired by another similar method called Eigen tracker, however, differs from it because of the added predictive modality. The predictive nature of the proposed method makes it versatile enough to track hand motion on fly without any requirement for description of the orientation and physical dimensions of the object offline as required for the previous eigentracker method. This predictive nature is induced by using the skin color for segmenting out the hand and then using particle filter to track the hand. The information about the position of hand is obtained by the significant eigenvectors in the eigen space as only hand is in motion while rest of the background is stationary.The variations in the motion direction are captured when the error between the prediction and the actual track exceeds certain threshold.Tracking Hand, along with the information about the change in motion track (captured by the error between prediction and actual position)can be used to map the information to a gesture would give the structure of the gesture (assuming linearity between motion ) which can be matched against the available gesture set (which is decided off line).


Discussion:


Well the nice part of the paper is the use of Particle filter for obtaining the information about the segments of the gesture by measuring the error between the prediction and the actual motion and the use of affine eigen space for capturing the hand motion. Well the method proposed is much different from the most of the papers we have read , though the results are not as impressive. As suggested by many in the class 100% doesn't make sense as the test data is 80% similar to the train data (64/80). Also, they have used PCA space to obtain the maximum variance, which is not robust for noise. I agree though with their simple steady background and black arms,it would work though in many real life situations this may not be feasible. I would have been happy with some complex gestures with slightly lower accuracy rather than 100 % accuracy (which I am not impressed with) with very distinct simple gestures.

However, I like vision based techniques as they provide much freedom and space for gestures which gloves do not provide and as such I would add this paper to my favorites in the semester.

With the advancement in the digital imaging techniques and capturing devices, it is possible to change the background to some other stationary background. So if the relative motion of the hand captured is faster than the change in the background, we shoule be able to capture the hand motion and blend it with some artificial background. By such an appraoch, we can tackle the problem of noises associated with changing background in PCA based techniques.

Friday, March 21, 2008

Wiizards: 3D Gesture Recognition for Game Play Input

This paper presents modern Wii controllers as the wands for casting spells in a game. The data from the accelerometers is used to obtain the xyz coordinates and are used to frame the gesture. There as certain gestures that can be used to cast a spell on the opponent. The gestures have been classified as: actions, modifiers, and blockers and HMM's have been used for recognition (another HMM though without Datagloves).

Training involved 7 different players who were asked to perform a given gesture over 40 times. The HMM presented accuracy of maximum 93% with 15 states and 90% with 10 states using test data from the same users as used for training. With new user there is a drastic drop to 50%.

Discussion:

Not at all impressive though something different.I am tired of explaining HMM's but most of the results they presented were quite obvious as all training based algorithms improve as more data is available.Nothing much to say

TIKL: Development of a Wearable Vibrotactile Feedback Suit for Improved Human Motor Learning

This paper presents the case that using an external real-time vibrotactile feedback suit can provide better training for teaching subjects to mimic the skills of an expert.This is because subjects may not be able to mimic the very minute details of motion and angle of the teacher as they may either not be able to observable or may not present the relative position of various joints. Their system can provide the real time response at the joints where subjects are not meeting the criterion and as such , subjects get to know all the positions relative to other positions where they went wrong and hence can correct the same appropriately.

This kind of feedback supplemented by the conventional auditory and visual feedback would definitely help the learners to learn better by knowing the errors in the body kinematics by vibratory feedback at that points.The system uses a Vicon motion capture system to track motion and the suit contains the vibro tactile sensors that provide the feedback and needs to be worn during the training

Possible applications have been suggested as sports training, dance, and other similar activities. They also conducted a user study in which around 40 participants were included but only 20 were provided with the suits and rest 20 were trained without suits. It was observed that the participants with the suit performed better because of the vibratory response. It was observed that the users with the suit had a 27% improvement in accuracy and an accelerated learning rate of 23% over the non suited counterparts under similar rest conditions.


Discussion:

It was a nice paper well written with good explanation. I liked the approach however it would have been nice if legs and the upper body feedback could also be provided as wrong placement of the three parts of the body can cause injuries.Such a system also provides an excellent way to have a remote teaching school as a instructor may not be physically availabe and the less skilled students can be trained online. I had some similar work in my under-grad when I used ANN to teach less skilled drivers real time- steering control for driving.

Spatio temporal extension to isomap nonlinear dimension reduction

This paper presents a spatio-temporal isomap which is basically an extension to the conventional isomap proposed by tenumbum in 2000. The presented method captures the temporal relationships between the neighborhood that can be propgated globally via a shortest path mechanism. ST Isomaps aim to deal with the proximal disambiguation which means to distinguish between spatially close data in the input space that is structurally different and finding the distal correspondence which means finding the common structure in the space. By finding these measures, one can find the spatio-temporal structure of the data. As epr the paper, proximal disambiguation and distance correspondence are the pair wise concepts and as such the existing pairwise dimension reduction needs to be augmmented to include spatio-temporal relationships.

The approach involves:

1 Windowing of the input data into temporal blocks which basically serves as a history of each data point.

2 Computation of sparse distance matrix D from the local neighborhood using Euclidean distance.

3) Using the Distance matrix obtained in above step to obtain the common temporal neighbors CTN which are either local temporal CTN or K-nearest non-trival neighbors.

4) Above measure is used to reduce the distance between points with common and adjacent temporal relationships.

5) The above metric is then used to obtain the shortest pair distance metric using Dijkstra's.

6) Classical MDS is applied to preserve the spacing .

Step 1,3,4 are the contribution of the paper . These steps introduce the temporal information in the Isomaps.


This paper presents the approach to apply ST isomaps on continuous data where K-nearest non trivial neighbor metric is used to find the best matching neighbor from each individual trajectory and removing any redundancy of selection of neighbor. Where as, Local Segmented common temporal neighbor is used for measuring the distance metric for non continuous data. The Segmented Common Temporal Neighbor hood approach is based on the logic that the pair of points are spatio-temporally similar, if they are spatially similar and the points they are transition to are also spatially similar.

They have applied their method a tele-operated NASA Robot to grasp wrenches placed at various locations, on obtaining the Kinematic motion data from human subjects. They have also given a comparison with PCA and standard Isomaps.They also showed some comparizon with HMMs.


Discussion:

They have added another important dimension to the data that can actually capture the motions that repeat in time and may be structurally similar though temporally they may be different for example a spiral motion . Adding temporal information makes a lot of sense as many dimensionality reduction techniques may give false results as they cannot distinguish between repetition of data as a temporal characteristic but in fact it will take it as redundancy of data which in fact it is not (I mean all data is not redundant).

I can now see that for gestures and motion capture we cannot use PCA if our gesture contains repetition of motion and for such cases ST -Isomaps is the solution to capture the embedded motion which characterizes the gesture

Wednesday, March 19, 2008

Articulated Hand Tracking by PCA-ICA Approach

This paper presents a vision based approach for capturing the hand postures and recognizing them. Authors have suggested the use of PCA for the location of hand in the image frame and then using ICA to obtain the intrinsic features from the hand to identify the hand motions.ICA has been described by the authors as a way to obtain a linear non-orthogonal coordinate system in any multivariate data. The goal of ICA is to perform linear transformation which makes the resulting variable as statistically independent from each other as possible.

They have represented the hand motions by modeling a Hand model in open GL and then using the information about the degrees of freedom of the hand fingers to obtain the various possible combinations available in which fingers touch the palm.

They used the data gloves to obtain the joint information of the possible 31 combinations . They then used that data to obtain the model parameters for various combinations generated by the open GL model over a time span and obtained around 2000 dimensional vector for each posture.


By PCA they reduces the dimensionality of the problem and were able to locate the position of the maximum variance in the image frame. Then by using the ICA model where each basis represents the motion of a particular finger they obtained the hand pose for a given time frame. They used the particle filtering method to track hands in accordance with the bayes theorem.

They employed the edge and silhouette model to match the hand frame with the open GL model and then estimated the closest match between the hand image and open GL models.By superimposing open GL model on hand image, they were able to recognize the posture.


Discussion:

I liked some different approach in this paper though I don't agree the statistically independent nature of fingers exists for all the hand postures. But considering the simplicity of their gestures, it might work. I likes that they used PCA for global hand tracking but PCA requires the bacground to be stable and only hand moving to track the variance. If there is some change in background (like the user moved a bit) PCA may give erroneous global results. Though for limited region, it may be feasible and simplest approach.

I would like to think more about the feasibility of ICA for intrinsic finger tracking, though presently I believe it is not possible to track fingers by this approach for the kind of complex motions we are aiming at,

The 3D Tractus: A three Dimensional Drawing Board

This paper presents a drawing system to draw shapes in 3D. It consiste of a Tablet PC mounted on the top of a mechanical structure that can move up or down using the dead counter weight and push of the user. Authors believe that by providing the mechanical 3D motion in the Z direction, it will be more intuitive for users to sketch in 3D. The mechanical device is constructed to minimize human effort in pushing/ pulling.

In order to capture the 3D data, they have used a simple potentiometer whose resistance varies with the movement (up and down of the mechanical structure), this information is calibrated to obtain Z value using a Analog to digital converter and provided to the PC via a USB connection.

The user interface of the system consists of a 2D drawing pad and a window which displays the view of the object being drawn by the user. In order to provide the depth cue in 2D, authors tried to use different color cues but found it to be confusing. They also tried using the varying thickness cues but found it also to be non intuitive. Finally they decided to present users with just the information that all thin strokes shown are actually below and users are drawing the top strokes. Similarly authors found out that projective views were more intutive and helpful than the orthographic projections and as such the window which displays the object being drawn shows projective view.Their system also provides provision for deletion so that users can edit their sketches.

They conducted a user study where they asked arts students to get familiar with their system and use it to draw certain sketches. Authors observed that the users liked their system though each user agreed that it was easier to push the table down than to pull it up.Also users reported that it would have been better if they could tilt the surface in the direction they were sketching the object. Also, they complained about alignment issues as they found it difficult to match the 3D symmetry of object being drawn (like top and bottom of Box)



Discussion:

I think it was a cool idea but very uncomfortable as user has to push and pull the table which seems much unintuitive to me. Also I would not be surprised if the user unintentionally pushed the table while sketching as some people tend to sketch with hard hand.

A hidden Markov Model Based Sensor fusion Approach for Recognizing Continuous human grasping Sequence

This paper presents a method that is aimed at teaching robots the human grasping sequence by observing the human grasping. Though the main aim of the systemis to just use vision for the purpose, they are using this approach as a faster alternative.Authors have argued that the grasping postures along with the tactile sensor feedback can be used for capturing the grasping sequence. As such , they have used 18 sensor cyber glove along with the tactile sensors which are sewed under the gloves and occupy certain position on hand , which have been identified as spots that have maximum chance of detecting the contact using smaller number of sensors.

For the classification purpose they have used Kamakura grasp Taxonomy which separates grasps into 14 different classes according to their purpose ,shape and contact points. With this taxonomy, it is easier to identify Human grasps that humans use in the very day life. As per the Kamakura taxanomy, gestures have been classified into 4 major catagories:

1. 5 Power Grasps
2. 4 intermediate grasps
3. 4 precision grasps
4. 1 thumb less grasp.

Each Grasp is actually modeled as a HMM and used for as a classifier.The data for the HMM is obtained through Cyberglove and this data is fused with the sensor data obtained from the tactile sensors for the particular grasp. This is done as using just the information from the data gloves may not be correct as the shape of hand between two grasps may not change significantly. They have obtained 16 feature values from the gloves and 16 sensor values along with the one maximum sensor value to frame the feature vector for the particular grasp. Their system taken in both the inputs simultaneously and learns to weigh their importance by adjusting parameters during the training.

Their model consists of 9 states for HMM's and the HMM's are trained offline for each gesture. Along with the 14 HMM's for grasp classes, a junk class for garbage collection was also trained. They have made a simple assumption that each grasp must be followed by a release. This is done to ensure segmentation of the gesture and the maximum of the grasp gives a cue about the grasp and non grasp.

For modeling the HMM's they have used the Georgia Tech HMM toolkit and collected 112
112 training samples and 112 testing samples from 4 different users. They have reported an maximum accuracy of the single user model (trained on 1 user data and tested on same user data)as 92.2% with minimum of 76.8% and for the multiple user system (trained on all and tested on a given user data) they have reported accuracy of 92.2% as maximum and minimum accuracy of 89.9%. They have also suggested that with more user data, single user model may get better than multi user model. They have also claimed that most of the recognition error came from a small set of grasps in which the system relied solely on tactile information to distinguish the grasps. They believe that improved sensor technology may improve the results.


Discussion:

This paper presented a new method which utilizes the tactile information for gesture recognition. It made much sense to me as common gestures for day to day tasks contain much of the tactile information which can be exploited to differentiate between the two similar looking gestures. for example a tight fist and hollow fist (sometimes used to show O) may look similar to the glove but including the tactile information can distinguish between the two.

Also, it would be interesting if the vision can be used to add more flexibility to the system as a similar looking gesture (based on tactile and cyberglove) at a different position in space may convey some different meaning, (but , this can be done using the Flock birds too). Also this paper made gave me another approach for the segmentation problem based on utilizing the tactile information

Monday, March 3, 2008

Temporal Classification:Extending the Classification Paradigm to Multivariate Time Series

This paper is basically a part of the much detailed thesis dealing with the Australian Sign Language. They are using two gloves to capture the data and analyze the recognition rate. One of the gloves used is the Nintendo glove and the other is a device called the Flock data. Nintendo is a low cost glove with the cheap sensors and Flock is the complicated superior device. The data obtained is used in their classifier called the Tclass, which looks like a decision Tree type classifier, and using the different parameters of the Tclass results were obtained.

Since the data obtained from the Nintendo glove is noisy they have used smoothening to tune in their results which helped to get better accuracy. In the end they used a voting methodology to get the best learners, similar to the ada boost, to improve the accuracy and decrease the error.

With the flock, they did the same thing how ever they found that the smoothening is actually affecting the recognition results as the sensor data is already much refined with almost nil noise.

Discussion:

The have shown that their classifier called the Tclass was able to provide a low error rate, by using the ensemble. They tested their data on Nintendo and Flock and smoothening worked with the Nintendo and not with the flock. I believe,it is because the data from Nintendo is so noisy that the distinguishing features are suppressed by the noise, while in case of the flock data, if we tend to smoothen the refined data accuracy will drop as the distinguishing features associated with the data are smoothened. It would have been nice to read in actual what Tclass is and what it does.I believe the only good thing in the paper was the Tclass classifier which is actually some kind of decision tree based classifier which needs to be investigated.

Using Ultrasonic Hand Tracking to augment Motion Analysis Based Recognition of manipulative Gestures

This paper deals with the introduction of the ultrasonic sensors with accelerometer and gyro meters for capturing the gestures to recognize the activity. The contribution of the paper has been claimed as the use of ultrasonic for the motion analysis and combining the information from the sensors to refine the results. The information obtained by different sensors used, is processed by a classifier so that the motion can be recognized. The classifiers used are: HMM’s, C.4.5, KNN. For capturing the sensor data, three ultrasonic beacons are placed on the top of the roof and the listeners are placed on the arms of the users. It is reported that the ultrasonic deals with the problem of reflection, occlusion and temporal resolution (low sampling rate) and hence the information provided by just ultrasonic sensors is not reliable and there are many false responses associated. Apart from this there is noise associated with the sensor input over the time frames which cannot be smoothened using Kalman filters as the sampling rate is much less than the frequency of hand movement.

In their experiment they have taken an example of bicycle repair and have chosen following gestures involving pumping, screwing screws or de-screwing them, different pedal turnings, assembly of the parts, wheel spinning and carrier object removing/ placing

They have tried approaches that can be classified into two categories:

  1. Model Based Approaches
  2. Frame based approaches.

In the model based approach they are using the HMMs on the sensory information obtained by the 2 gyro meters and 3 accelerometers on the users’ right hand and also the same set of sensors on the upper right hand.

In the frame based approach the feature vectors are obtained during each time frame and used for either training or testing of the classifier. The set of features extracted are: mean, standard deviation, median of the raw sensor data. This approach captures the local features and can be used to obtain the local characteristics which can be exploited for training or testing the classifier. For their approach, they are using no overlaps between the adjacent windows and using the part of the feature vectors from a frame for training and part for testing. For the comparison they are using the Kmeans and the C.4.5 classifier for testing the frame based approach.

They also proposed the use of plausible analysis for classification which actually means restriction of the search space for the vectors and using the information from both frame based and model based approaches for classification. For example the result of the gesture recognition obtained by the HMMs is compared with the constraint restriction imposed and if it is satisfied, the gesture is selected else next best gesture satisfying both constraints is selected.

In the results they have claimed that the ultrasonic with the C.4.5 classifiers produced the results close to 58.7% and with K-means they have shown 60.3 % classification. They have argued that since most of their gestures are not distinguishable using only hand locations, the results were affected. They then used a kind of ensamble to merge the inputs from the accelerometer and gyro meter and obtained a high classification in (90 % range). They argued that for certain gestures they almost achieved 100% while for some gestures, which are ambiguous and can be confused with other gestures, there was a drop in classification.

Disussion:

They have just merged the ultrasonics with the accelerometer data and used the same to get the sensory information of the local movement associated with the body part with the global position of the part. This method is definitely going to yield better results as we have two layers (Global and then Local) of classification as we vote for the gesture that meets requirements for both layers and it is no surprising. The method is intuitive but needs a specialized room, as ultrasonic waves are reflected by metallic objects and also occlusion affects the response. They have not addressed the issues with occlusion associated with our requirements as we are dealing with fingers and hand movements which cannot be prevented from occlusion using the top, mounted beckons. May be we can use array of beckons on ground ,top, sides to capture the responses but that would restrict the doimain of application for the user.

Also I feel, instead of wired accelerometers and gyro meters, it should be nice to use the wireless sensors which were described in the last week’s paper on “ASL for game development”. I would be really light weighted and more particle to use.