GAM-NodeFormer : graph-attention multi-modal emotion recognition in conversation with node transformer

Huang, Zilong

Full metadata record

DC Field	Value	Language
dc.contributor	Department of Electrical and Electronic Engineering	en_US
dc.contributor.advisor	Mak, M. W. (EEE)	en_US
dc.creator	Huang, Zilong	-
dc.identifier.uri	https://theses.lib.polyu.edu.hk/handle/200/13910	-
dc.language	English	en_US
dc.publisher	Hong Kong Polytechnic University	en_US
dc.rights	All rights reserved	en_US
dc.title	GAM-NodeFormer : graph-attention multi-modal emotion recognition in conversation with node transformer	en_US
dcterms.abstract	Emotion Recognition in Conversation (ERC) has great prospects in areas such as human-computer interaction and medical counseling. In dialogue videos, the emotion of a speaker can be expressed through different modalities, including text, speech, and visual. For multimodal ERC, the fusion of different modalities is crucial. Existing multimodal ERC approaches often concatenate multimodal features without considering the differences in the emotion information from individual modalities. In particular, not much attention was spent on balancing the contribution from the dominant and auxiliary modalities, leading to suboptimal multimodality fusion.	en_US
dcterms.abstract	To address the aforementioned issues, we propose a multimodal network called GAM-NodeFormer for conversational emotion recognition. The network leverages the features at different stages of a transformer encoder and performs feature fusion at multiple stages. Specifically, in the early fusion stage, we introduce a NodeFormer module for multimodal feature fusion. The module uses a Transformer-based fusion mechanism to combine emotion features extracted from the visual, audio, and textual modalities. It also leverages the advantages of the dominant modality and enhances the complementarity between modalities. Afterwards, the fused features are updated by a graph neural network to build a dialogue environment. We design a graph attention module for the late fusion stage to refine the multimodal features before and after the graph network update, thereby improving the final quality of the fused features.	en_US
dcterms.abstract	To evaluate the proposed model, we conducted extensive experiments on two public benchmark datasets: MELD and IEMOCAP. Results show that the proposed model can achieve a new state-of-the-art performance in ERC, demonstrating the effectiveness and superiority of the model.	en_US
dcterms.extent	vi, 42 pages : color illustrations	en_US
dcterms.isPartOf	PolyU Electronic Theses	en_US
dcterms.issued	2023	en_US
dcterms.educationalLevel	M.Sc.	en_US
dcterms.educationalLevel	All Master	en_US
dcterms.accessRights	restricted access	en_US

Files in This Item:

File	Description	Size	Format
8263.pdf	For All Users (off-campus access for PolyU Staff & Students only)	1.28 MB	Adobe PDF	View/Open

Copyright Undertaking

As a bona fide Library user, I declare that:

I will abide by the rules and legal ordinances governing copyright regarding the use of the Database.
I will use the Database for the purpose of my research or private study only and not for circulation or further reproduction or any other purpose.
I agree to indemnify and hold the University harmless from and against any loss, damage, cost, liability or expenses arising from copyright infringement or unauthorized usage.

By downloading any item(s) listed above, you acknowledge that you have read and understood the copyright undertaking as stated above, and agree to be bound by all of its terms.

Show simple item record

Please use this identifier to cite or link to this item: https://theses.lib.polyu.edu.hk/handle/200/13910