Adversarial learning for speaker verification and speech emotion recognition

Yi, Lu

Author:	Yi, Lu
Title:	Adversarial learning for speaker verification and speech emotion recognition
Advisors:	Mak, M. W. (EEE)
Degree:	Ph.D.
Year:	2024
Subject:	Automatic speech recognition Emotion recognition Biometric identification Deep learning (Machine learning) Machine learning Hong Kong Polytechnic University -- Dissertations
Department:	Department of Electrical and Electronic Engineering
Pages:	1 volume (various pagings) : color illustrations
Language:	English
Abstract:	Deep learning employs optimization algorithms to train neural networks to learn knowledge from data. Despite the remarkable success of deep learning, training deep learning models remains challenging. For instance, collecting data can be costly, and insufficient training data may affect the models’ effectiveness to make decisions on unseen data. Additionally, deploying models trained on labeled data from one domain to another can lead to domain mismatch issues. This dissertation addresses the data sparsity and domain mismatch problems in speaker veriﬁcation and speech emotion recognition. Speaker veriﬁcation, a biometric authentication method that uses one’s voice to verify a claimed identity, experiences performance degradation when applied to unseen domains. This thesis proposes several domain adaptation frameworks to mitigate this issue. One such framework is the adversarial separation and adaptation network (ADSAN), which disentangles domain-speciﬁc and shared components from speaker embeddings, achieving domain-invariant speaker representations. Moreover, a mutual information neural estimator (MINE) is integrated into the ADSAN to enhance the preservation of speaker discriminative information. Another proposed framework, the infomax domain separation and adaptation network (InfoMax-DSAN), applies domain adaptation directly to the speaker feature extractor, achieving an EER of 5.69% on the VOiCES Challenge 2019. Conventional domain adaptation methods assume a common set of speakers across domains, which is impractical for speaker veriﬁcation. To address this limitation, this thesis proposes incorporating the intra-speaker and between-speaker similarity distribution alignment to DSANs. While effective in reducing language mismatches, this framework is constrained to lightweight models. To enhance ﬂexibility and scalability, a novel disentanglement approach for domain-speciﬁc features is introduced. It incorporates a shared frame-level feature extractor, which then diverges into a domain classiﬁcation branch and a speaker classiﬁcation branch and forces the gradients from the domain branch not to interfere with the shared layers. Experimental results demonstrate improved performance on CN-Celeb1 and feasibility with more complex models, such as the residual networks. In speech emotion recognition, acquiring labeled data for training emotion classiﬁers poses challenges due to the ambiguity of speech containing multiple emotions. This data scarcity problem leads to overﬁtting. To tackle this issue, this thesis introduces a new data augmentation network called adversarial data augmentation network (ADAN). By forcing synthetic and real samples to share a common representation in the latent space, ADAN can alleviate the gradient vanishing problem that often occurs in a generative adversarial network. Experimental results on the EmoDB and IEMOCAP datasets demonstrate the effectiveness of ADAN in generating emotion-rich augmented data, yielding emotion classiﬁers competitive to state-of-the-art systems.
Rights:	All rights reserved
Access:	open access

Files in This Item:

File	Description	Size	Format
7829.pdf	For All Users	14.52 MB	Adobe PDF	View/Open

Copyright Undertaking

As a bona fide Library user, I declare that:

I will abide by the rules and legal ordinances governing copyright regarding the use of the Database.
I will use the Database for the purpose of my research or private study only and not for circulation or further reproduction or any other purpose.
I agree to indemnify and hold the University harmless from and against any loss, damage, cost, liability or expenses arising from copyright infringement or unauthorized usage.

By downloading any item(s) listed above, you acknowledge that you have read and understood the copyright undertaking as stated above, and agree to be bound by all of its terms.

Show full item record

Please use this identifier to cite or link to this item: https://theses.lib.polyu.edu.hk/handle/200/13408