Author: Chen, Xinyi
Title: An investigation of human listeners’ processing of disfluent robot speech
Advisors: Yao, Yao (CBS)
Lee, Yat Mei Sophia (CBS)
Degree: Ph.D.
Year: 2024
Subject: Speech processing systems
Human-robot interaction
Human-computer interaction
Oral communication
Hong Kong Polytechnic University -- Dissertations
Department: Department of Chinese and Bilingual Studies
Pages: 244 pages : color illustrations
Language: English
Abstract: One major area of interest in speech communication research is how people adjust their speaking and listening patterns based on their conversation partner. This adaptation manifests in changes in speech production —such as increased loudness and clarity towards non-native speakers, and more animated, affectionate expressions towards infants and pets. Moreover, listener perception adjusts based on presuppositions about the speaker’s demographics, affecting phoneme discrimination. This thesis investigates human-robot interaction, specifically how disfluencies produced by robots are perceived and processed by humans, in contrast to disfluencies in human-human interactions.
In an era where interactions with machine-generated speech become commonplace, understanding human perceptions of such speech versus natural speech is crucial, yet underexplored. As speech technology advances, machine speech increasingly mimics the naturalness of human conversation, raising questions about our perceptual distinctions between the two. This dissertation probes whether the inclusion of disfluencies (like “um” and “uh”), common in human speech but rare in machine speech, influences our perception of machine speech’s naturalness. Utilizing the Furhat talking robot system to simulate machine speech, the study specifically examines our reactions to these speech patterns, questioning if they make machine speech seem more human-like or if we still perceive a clear divide between machine and natural speech.
Specifically, the focus is on the perception of machine speech containing disfluencies (filled pauses such as “um” and “uh”), prevalent in natural speech but not as common in machine speech yet. A talking robot system, Furhat, is used to generate or embody machine speech.
This dissertation reports on two studies. Study 1 explored whether filled pauses in machine or natural speech could improve listener information retention. It employed a memory test where participants listened to short stories and were later assessed on their recall of plot details. Participants were divided into two groups: a “baseline” group that received auditory stimuli via computer in a self-paced setting, and a “robot-interaction” group that interacted with Furhat, a robot that provided instructions and narrated stories. The key focus was on the effect of disfluencies (filled pauses) in the stories. The study incorporated two memory assessment methods (multiple-choice questions and story retelling), two types of voices (pre-recorded human and text-to-speech synthesized), and, in Experiments 1c and 1d, an additional type of disfluency (silent pauses). Overall, Study 1 found no significant impact of disfluency presence on memory retention across both the baseline and robot-interaction groups.
Study 2 investigates the pragmatic interpretations of filled pauses in conversational contexts, whether in machine or natural speech. The methodology involved presenting participants with dialogues where the final statement might suggest the speaker’s attempt to avoid conveying an unwelcome fact or opinion, with this statement potentially preceded by a filled pause. Participants, after listening to or watching these dialogues, selected statements aligning with their interpretations. The hypothesis posited that filled pauses before the final turn increase the likelihood of perceiving it as an attempt to dodge unwelcome messages. This was examined through two experimental setups: Experiment 2a, an audio-only condition serving as the baseline, and Experiment 2b, an audiovisual condition featuring Furhat the robot delivering the final conversational turn. Results across both conditions confirmed the hypothesis, showing a consistent interpretation of disfluency as indicative of avoidance, regardless of the medium.
The findings from the two studies presented in this dissertation demonstrate that disfluencies in machine speech impact perception similarly to disfluencies in natural speech. These results have significant implications for human-robot interaction models, such as the CASA (Computers as Social Actors) paradigm, highlighting how elements like humanlikeness and voice naturalness influence interaction patterns.
Rights: All rights reserved
Access: open access

Files in This Item:
File Description SizeFormat 
7506.pdfFor All Users7.6 MBAdobe PDFView/Open


Copyright Undertaking

As a bona fide Library user, I declare that:

  1. I will abide by the rules and legal ordinances governing copyright regarding the use of the Database.
  2. I will use the Database for the purpose of my research or private study only and not for circulation or further reproduction or any other purpose.
  3. I agree to indemnify and hold the University harmless from and against any loss, damage, cost, liability or expenses arising from copyright infringement or unauthorized usage.

By downloading any item(s) listed above, you acknowledge that you have read and understood the copyright undertaking as stated above, and agree to be bound by all of its terms.

Show full item record

Please use this identifier to cite or link to this item: https://theses.lib.polyu.edu.hk/handle/200/13054