Research (R)
Emily Keller
University of Louisville
University of Louisville
Louisville, Kentucky
Disclosure(s): No financial or nonfinancial relationships to disclose.
Yonghee Oh, PhD
Assistant Professor
Department of Otolaryngology, HNS and Communicative Disorders, School of Medicine, University of Louisville
University of Louisville
Louisville, Kentucky
Disclosure(s): No financial or nonfinancial relationships to disclose.
Rationale/Purpose
Listening to speech in a loud background noise situation is not a simple task. Competing noises in the environment can overwhelm the target speech, causing a phenomenon called ‘masking’. Listeners can experience this masking effect in everyday listening situations such as conversations in restaurants, classrooms, parties, and many other noisy places. In recent years, several studies have been conducted to investigate the effectiveness of integrating other sensory cues, specifically visual and tactile cues, to enhance listeners’ speech perception performance in complex listening environments. Many of these studies have shown that multisensory integration can enhance a listener’s speech perception in audio-visual (AV) and audio-tactile (AT) conditions using non-articulatory information (i.e., abstract visual images) as the visual cue and a sensory substitution device (i.e., a device that transforms low-frequency speech signals into tactile vibrations) as the tactile cue (e.g., Oh et al., 2022; 2023).
However, there has been little research done on demonstrating listener’s benefit from real-time multisensory speech processing. Understanding the difference in processing time between the different sensory modalities provides a piece of important information in the development of real-time processors that will utilize multisensory information. The purpose of this study was to explore temporal characteristics for real-time multisensory speech processing.
Methods
Twenty young adults participated in simultaneity judgment (SJ) experiments using the method of constant stimuli: two cross-modal sensory stimuli (AV: auditory-visual; AT auditory-tactile) were presented in various stimulus onset asynchronies (SOAs: -1000 msec ~ 1000 msec). The auditory stimulus was a speech-shaped noise, and the presentation level was fixed at a comfortable level. The visual stimulus consisted of a sphere with a radius changing between 0 and 20 cm, and the presentation intensity was fixed at 79 cd/m2. The tactile stimulus consisted of vibration with the force varying between 0 and 1 G, and the speed was fixed at 12,000 rpm. All stimuli were presented with various durations (100, 200, 400, and 800 msec). Two multisensory temporal coherence cues (TBW: temporal binding window; PSS: point of subjective simultaneity) were estimated from the SJ functions (TBW: the SOA range between 50% points of the SJ function; PSS: the SOA at the peak of the SJ function).
Results/Conclusions
Results showed that the average TBWs in the auditory-visual and auditory-tactile domains were broadened with increasing stimulus duration, maximum at 350 and 250 msec, respectively. In contrast, the PSSs remained constant with stimulus duration. These findings suggest that the subject can benefit from real-time multisensory cues only when they are presented in the optimum temporal windows. For example, the auditory-visual cues should be presented within a 350-msec window in order to benefit the listener’s speech perception. Further testing should include listeners with hearing loss to determine how the multisensory temporal integration for this population differs from that of normal-hearing listeners. The findings may be applied to future rehabilitation approaches using auditory training programs to enhance speech perception in noise and implications for potential technological enhancements to speech perception with real-time multisensory hearing-assistive devices.