Author: Yi, Lu
Title: Adversarial learning for speaker verification and speech emotion recognition
Advisors: Mak, M. W. (EEE)
Degree: Ph.D.
Year: 2024
Subject: Automatic speech recognition
Emotion recognition
Biometric identification
Deep learning (Machine learning)
Machine learning
Hong Kong Polytechnic University -- Dissertations
Department: Department of Electrical and Electronic Engineering
Pages: 1 volume (various pagings) : color illustrations
Language: English
Abstract: Deep learning employs optimization algorithms to train neural networks to learn knowledge from data. Despite the remarkable success of deep learning, training deep learning models remains challenging. For instance, collecting data can be costly, and insufficient training data may affect the models’ effectiveness to make decisions on unseen data. Additionally, deploying models trained on labeled data from one domain to another can lead to domain mismatch issues. This dissertation addresses the data sparsity and domain mismatch problems in speaker verification and speech emotion recognition.
Speaker verification, a biometric authentication method that uses one’s voice to verify a claimed identity, experiences performance degradation when applied to unseen domains. This thesis proposes several domain adaptation frameworks to mitigate this issue. One such framework is the adversarial separation and adaptation network (ADSAN), which disentangles domain-specific and shared components from speaker embeddings, achieving domain-invariant speaker representations. Moreover, a mutual information neural estimator (MINE) is integrated into the ADSAN to enhance the preservation of speaker discriminative information. Another proposed framework, the infomax domain separation and adaptation network (InfoMax-DSAN), applies domain adaptation directly to the speaker feature extractor, achieving an EER of 5.69% on the VOiCES Challenge 2019.
Conventional domain adaptation methods assume a common set of speakers across domains, which is impractical for speaker verification. To address this limitation, this thesis proposes incorporating the intra-speaker and between-speaker similarity distri­bution alignment to DSANs. While effective in reducing language mismatches, this framework is constrained to lightweight models. To enhance flexibility and scala­bility, a novel disentanglement approach for domain-specific features is introduced. It incorporates a shared frame-level feature extractor, which then diverges into a do­main classification branch and a speaker classification branch and forces the gradients from the domain branch not to interfere with the shared layers. Experimental results demonstrate improved performance on CN-Celeb1 and feasibility with more complex models, such as the residual networks.
In speech emotion recognition, acquiring labeled data for training emotion clas­sifiers poses challenges due to the ambiguity of speech containing multiple emotions. This data scarcity problem leads to overfitting. To tackle this issue, this thesis intro­duces a new data augmentation network called adversarial data augmentation network (ADAN). By forcing synthetic and real samples to share a common representation in the latent space, ADAN can alleviate the gradient vanishing problem that often occurs in a generative adversarial network. Experimental results on the EmoDB and IEMOCAP datasets demonstrate the effectiveness of ADAN in generating emotion-rich augmented data, yielding emotion classifiers competitive to state-of-the-art systems.
Rights: All rights reserved
Access: open access

Files in This Item:
File Description SizeFormat 
7829.pdfFor All Users14.52 MBAdobe PDFView/Open


Copyright Undertaking

As a bona fide Library user, I declare that:

  1. I will abide by the rules and legal ordinances governing copyright regarding the use of the Database.
  2. I will use the Database for the purpose of my research or private study only and not for circulation or further reproduction or any other purpose.
  3. I agree to indemnify and hold the University harmless from and against any loss, damage, cost, liability or expenses arising from copyright infringement or unauthorized usage.

By downloading any item(s) listed above, you acknowledge that you have read and understood the copyright undertaking as stated above, and agree to be bound by all of its terms.

Show full item record

Please use this identifier to cite or link to this item: https://theses.lib.polyu.edu.hk/handle/200/13408