Monday, April 14, 2008

Glove-TalkII - A Neural-Network Interface which Maps Gestures to Parallel Formant Speech Synthesizer Controls

This paper presents a gesture based speech synthesizer called glove-talk-II that can be used for communication. The idea behind the whole work is that by recognizing the subtasks involved in speech generation (tongue motion, alphabet generation, sound used), they can be mapped to suitable actions mapped to the sensor devices The device consist of a cyber glove (18 Sensor) for gestures, a polhemus for emulating the tongue motion (up down) for consonants, a foot pedal for sound variation. The input from all the three sensors is fed to 3 different neural networks trained on the data for that particular subtask.

There are three networks used which are: Vowel/Consonant network, Consonant network and Vowel network. A Vowel/consonant network is responsible for recognizing if the emit the vowel or consonant sound based on the configuration of the user hands. Authors have claimed the network can interpolate between hand configurations to produce smooth but rapid transitions between vowels and consonants. The network trained is a 10-5-1 feed forward network with sigmoid activation function. They have trained it using 2600 examples of consonants and 700 examples of vowels. Training data was collected from the expert user while vowel data was collected from the same user by requiring him to move hands up and down. The test consists of 1614 examples and MSE was reduced to 10^-4 after the training.

The vowel network is a 2-11-8 FF network and hidden units are the RBF’s which are centered to the respond to one of the cardinal networks. Outputs are 8 sigmoid units representing 8 synthesized control parameters. The network was trained on 1100 examples with some noise added. The network was tested on 550 examples and gave MSE of 0.0016 after training.

The consonant network is a 10-14-9 network with 14 hidden units as normalized RBF units. Each of the RBF unit is centered at Hand configuration determined from the training data. The 9 output units are responsible for the nine control parameters of the formant synthesizer. The network is trained on 350 approximants, 1510 fricatives and 700 nasal scaled data. Test data consists of 255 approximants, 960 fricatives and 165 nasal data giving MSE of 0.005.

They tested their product on an expert pianist and with 100 hours of training, his speech with glove-talk-II was found to be intelligible though still with errors in polysyllable words.




Discussion:

I just wonder how difficult / easy is to control your speech with foot pedal, data glove and polhemus all of which need to move in synchronization for correct pronunciation. Also with 100 hours of training results were intelligible and I doubt if this system is comfortable at all. But for a paper 10 years old, it is a decent first step and a different approach.

No comments: