Expressive Speech Synthesis for Speech-Generating Devices

Period of Performance: 04/01/2015 - 09/30/2015


DESCRIPTION (provided by applicant): The ability to express emotion through speech is an essential element of human communication, yet individuals with speech impairments who rely on speech-generating devices to communicate are often deprived of this ability. Current speech-generating devices offer at most a limited number of expressive speech modes (e.g., a "happy" voice or a "sad" voice), and even when those are provided, they are often not perceived by listeners in the intended way. The proposed project aims to remedy this unfortunate situation by enhancing the rule-based Synfonica text-to-speech system with the ability to accurately convey a broad range of emotional states. The approach to be tested centers on a set of expressive speech meta-parameters that collectively describe the "higher-level" characteristics of the speech output required for conveying the intended emotion for a given utterance. Examples of such meta-parameters include speaking rate, pitch range, and degree of breathiness. The text-to-speech rules will implement the complex mapping from meta-parameter values to the relevant "lower-level" acoustic parameter values used by a synthesizer to produce the final speech waveform. Examples of such acoustic parameters are the fundamental frequency of voicing, formant frequencies, and voicing amplitude. The Phase I project aims to test the feasibility of this approach by enhancing the Synfonica system to convey four emotions-elation, sadness, and two flavors of anger ("hot" anger and "cold" anger)-on a restricted set of utterance types. A set of perceptual tests will be conducted to determine the extent to which listeners recognize the intended emotions in the synthesized speech. If successful, the Phase I project will lay the groundwork for providing users of speech-generating devices with the means to effectively convey their emotional states through speech. It also will further scientific knowledge regarding the perceptual cues used by listeners in gauging a speaker's emotional state.