Author: Chen, Xi
Title: Co-production of speech and facial gestures : a study of focus sentences in mandarin under natural and manipulated conditions
Advisors: Chen, Si (LST)
Yao, Yao (LST)
Degree: Ph.D.
Year: 2025
Department: Department of Language Science and Technology
Pages: 351 pages : color illustrations
Language: English
Abstract: Speech prosody and facial gestures are tightly integrated in human communication, yet their coordination under altered feedback remains poorly understood. This dissertation investigates the spatial and temporal coordination between vocal prosody and facial gestures in Mandarin Chinese, a tonal language where pitch is lexically significant. Four controlled experiments were conducted, culminating in a comprehensive analysis in Chapter 6 that examines all six feedback perturbation conditions (NA, NV, GA, DA, GV, DV). These conditions systematically manipulated auditory and visual feedback — including normal versus perturbed auditory feedback and normal versus perturbed visual feedback — to challenge the multimodal speech production system.
Methodologically, acoustic measures of prosody (fundamental frequency, intensity, and duration) were recorded alongside high-resolution facial gesture data (head movements, eyebrow raises, and jaw displacements) from native Mandarin speakers. The analysis integrates acoustic and facial prosody using linear mixed-effects modeling and is framed by dynamic systems theory (DST) to assess how the two modalities function as a coupled system. I tested competing hypotheses about multimodal coordination: a Trade-off hypothesis, which predicts that if one modality's feedback is degraded speakers will compensate by enhancing the other modality, versus a Hand-in-hand (synergy) hypothesis, which posits that vocal and facial prosody are augmented in tandem to convey emphasis.
Key findings reveal clear patterns of modality-specific compensation and flexible integration. Perturbations in auditory feedback elicited significant adjustments in both voice and facial gesture: for example, when auditory feedback was delayed or masked, speakers produced more pronounced facial expressions (larger head and eyebrow movements) and often extended syllable duration, pitch, and intensity, consistent with cross-modal compensation. Likewise, under visual feedback perturbations (e.g., obscured or delayed visual cues), speakers enhanced acoustic prosodic features such as F0 range and intensity to ensure critical tonal and emphatic information was conveyed. In some conditions, feedback alterations led to prosodic enhancement (exaggerated pitch and loudness or emphatic facial gestures), while other conditions caused prosodic suppression (reduced variability in F0 or gesture magnitude), indicating that feedback can modulate how energetically prosody is expressed. Importantly, the timing alignment between facial gesture apexes and the corresponding F0 peaks was affected by feedback changes: under normal conditions these events were tightly synchronized, whereas certain perturbations introduced measurable asynchrony, reflecting a reorganization of coordination timing in the multimodal system.
Overall, the results support both the Trade-off and Hand-in-hand hypotheses in complementary ways. Even when one channel's feedback was disrupted, speakers maintained communicative efficacy by boosting signals in the other channel (supporting a two-way compensatory Trade-off). At the same time, vocal and facial modalities generally worked Hand-in-hand, rising and falling together to mark prosodic focus when conditions allowed, underscoring their synergy in prosodic communication. Viewed through a DST framework, these findings suggest that speech prosody and facial gesture form an integrated dynamical system that can compensatively re-coordinate itself under perturbation. This research advances theoretical understanding of multimodal speech production, demonstrating how Mandarin speakers dynamically balance and synchronize auditory and visual prosodic features. The dissertation's insights shed light on the resilience and flexibility of prosodic coordination in a tonal language, highlighting the compensative coupling of voice and facial gestures in conveying meaning under both normal and altered feedback conditions.
Rights: All rights reserved
Access: open access

Files in This Item:
File Description SizeFormat 
8623.pdfFor All Users41.5 MBAdobe PDFView/Open


Copyright Undertaking

As a bona fide Library user, I declare that:

  1. I will abide by the rules and legal ordinances governing copyright regarding the use of the Database.
  2. I will use the Database for the purpose of my research or private study only and not for circulation or further reproduction or any other purpose.
  3. I agree to indemnify and hold the University harmless from and against any loss, damage, cost, liability or expenses arising from copyright infringement or unauthorized usage.

By downloading any item(s) listed above, you acknowledge that you have read and understood the copyright undertaking as stated above, and agree to be bound by all of its terms.

Show full item record

Please use this identifier to cite or link to this item: https://theses.lib.polyu.edu.hk/handle/200/14168