Co-production of speech and facial gestures : a study of focus sentences in mandarin under natural and manipulated conditions

Chen, Xi

Full metadata record

DC Field	Value	Language
dc.contributor	Department of Language Science and Technology	en_US
dc.contributor.advisor	Chen, Si (LST)	en_US
dc.contributor.advisor	Yao, Yao (LST)	en_US
dc.creator	Chen, Xi	-
dc.identifier.uri	https://theses.lib.polyu.edu.hk/handle/200/14168	-
dc.language	English	en_US
dc.publisher	Hong Kong Polytechnic University	en_US
dc.rights	All rights reserved	en_US
dc.title	Co-production of speech and facial gestures : a study of focus sentences in mandarin under natural and manipulated conditions	en_US
dcterms.abstract	Speech prosody and facial gestures are tightly integrated in human communication, yet their coordination under altered feedback remains poorly understood. This dissertation investigates the spatial and temporal coordination between vocal prosody and facial gestures in Mandarin Chinese, a tonal language where pitch is lexically significant. Four controlled experiments were conducted, culminating in a comprehensive analysis in Chapter 6 that examines all six feedback perturbation conditions (NA, NV, GA, DA, GV, DV). These conditions systematically manipulated auditory and visual feedback — including normal versus perturbed auditory feedback and normal versus perturbed visual feedback — to challenge the multimodal speech production system.	en_US
dcterms.abstract	Methodologically, acoustic measures of prosody (fundamental frequency, intensity, and duration) were recorded alongside high-resolution facial gesture data (head movements, eyebrow raises, and jaw displacements) from native Mandarin speakers. The analysis integrates acoustic and facial prosody using linear mixed-effects modeling and is framed by dynamic systems theory (DST) to assess how the two modalities function as a coupled system. I tested competing hypotheses about multimodal coordination: a Trade-off hypothesis, which predicts that if one modality's feedback is degraded speakers will compensate by enhancing the other modality, versus a Hand-in-hand (synergy) hypothesis, which posits that vocal and facial prosody are augmented in tandem to convey emphasis.	en_US
dcterms.abstract	Key findings reveal clear patterns of modality-specific compensation and flexible integration. Perturbations in auditory feedback elicited significant adjustments in both voice and facial gesture: for example, when auditory feedback was delayed or masked, speakers produced more pronounced facial expressions (larger head and eyebrow movements) and often extended syllable duration, pitch, and intensity, consistent with cross-modal compensation. Likewise, under visual feedback perturbations (e.g., obscured or delayed visual cues), speakers enhanced acoustic prosodic features such as F0 range and intensity to ensure critical tonal and emphatic information was conveyed. In some conditions, feedback alterations led to prosodic enhancement (exaggerated pitch and loudness or emphatic facial gestures), while other conditions caused prosodic suppression (reduced variability in F0 or gesture magnitude), indicating that feedback can modulate how energetically prosody is expressed. Importantly, the timing alignment between facial gesture apexes and the corresponding F0 peaks was affected by feedback changes: under normal conditions these events were tightly synchronized, whereas certain perturbations introduced measurable asynchrony, reflecting a reorganization of coordination timing in the multimodal system.	en_US
dcterms.abstract	Overall, the results support both the Trade-off and Hand-in-hand hypotheses in complementary ways. Even when one channel's feedback was disrupted, speakers maintained communicative efficacy by boosting signals in the other channel (supporting a two-way compensatory Trade-off). At the same time, vocal and facial modalities generally worked Hand-in-hand, rising and falling together to mark prosodic focus when conditions allowed, underscoring their synergy in prosodic communication. Viewed through a DST framework, these findings suggest that speech prosody and facial gesture form an integrated dynamical system that can compensatively re-coordinate itself under perturbation. This research advances theoretical understanding of multimodal speech production, demonstrating how Mandarin speakers dynamically balance and synchronize auditory and visual prosodic features. The dissertation's insights shed light on the resilience and flexibility of prosodic coordination in a tonal language, highlighting the compensative coupling of voice and facial gestures in conveying meaning under both normal and altered feedback conditions.	en_US
dcterms.extent	351 pages : color illustrations	en_US
dcterms.isPartOf	PolyU Electronic Theses	en_US
dcterms.issued	2025	en_US
dcterms.educationalLevel	Ph.D.	en_US
dcterms.educationalLevel	All Doctorate	en_US
dcterms.LCSH	Mandarin dialects -- Phonology	en_US
dcterms.LCSH	Prosodic analysis (Linguistics)	en_US
dcterms.LCSH	Speech	en_US
dcterms.LCSH	Facial expression	en_US
dcterms.LCSH	Hong Kong Polytechnic University -- Dissertations	en_US
dcterms.accessRights	open access	en_US

Files in This Item:

File	Description	Size	Format
8623.pdf	For All Users	41.5 MB	Adobe PDF	View/Open

Copyright Undertaking

As a bona fide Library user, I declare that:

I will abide by the rules and legal ordinances governing copyright regarding the use of the Database.
I will use the Database for the purpose of my research or private study only and not for circulation or further reproduction or any other purpose.
I agree to indemnify and hold the University harmless from and against any loss, damage, cost, liability or expenses arising from copyright infringement or unauthorized usage.

By downloading any item(s) listed above, you acknowledge that you have read and understood the copyright undertaking as stated above, and agree to be bound by all of its terms.

Show simple item record

Please use this identifier to cite or link to this item: https://theses.lib.polyu.edu.hk/handle/200/14168