Full metadata record
| DC Field | Value | Language |
|---|---|---|
| dc.contributor | Department of Electrical and Electronic Engineering | en_US |
| dc.contributor.advisor | Mak, Man-wai (EEE) | en_US |
| dc.creator | Li, Zhe | - |
| dc.identifier.uri | https://theses.lib.polyu.edu.hk/handle/200/14169 | - |
| dc.language | English | en_US |
| dc.publisher | Hong Kong Polytechnic University | en_US |
| dc.rights | All rights reserved | en_US |
| dc.title | Maximal speaker separability via robust speaker representation learning | en_US |
| dcterms.abstract | Speaker representation learning aims to extract compact, discriminative embeddings that encapsulate unique vocal characteristics regardless of linguistic content or environmental conditions. The objective is to learn an embedding space with two key properties: same-class compactness, where embeddings from the same speaker are closely clustered, and different-class dispersion, where embeddings from different speakers are well separated. However, existing methods face several challenges. First, conventional speaker verification methods treat the task as a classification problem, relying on softmax-based loss functions to maximize inter-class differences. However, these loss functions often struggle to reduce intra-class variation. Second, directly applying a pre-trained model to speaker verification can only achieve sub-optimal performance because the pre-trained model is insufficient for extracting task-specific features, leading to limited transferability. Full fine-tuning of these models introduces significant computational and storage costs while risking catastrophic forgetting. Third, while the pre-trained speech models offer robust feature representations, their effectiveness relies on an unrealistic assumption: the speaker identity information and the linguistic content in the representations can be easily disentangled. | en_US |
| dcterms.abstract | To address these challenges, we propose three key solutions in this thesis. First, we propose a supervised contrastive learning framework incorporating an additive angular margin to effectively reduce intra-class variation. By maximizing the mutual information between frame-level features and speaker representations, our method preserves nonshared speaker information across diverse augmentations. Extensive evaluations on CN-Celeb, VoxCeleb, and CU-MARVEL datasets demonstrate that the resulting ECAPA-TDNN embedding space exhibits robust inter-speaker separability and intra-speaker consistency. Second, we investigate parameter-efficient fine-tuning strategies for pre-trained Transformer models in speaker verification. By integrating dynamic prompt tuning—where prompts are clustered based on speaker-specific traits—and incorporating spectral information into a LoRA-based adaptation process, our approach efficiently captures task-relevant features while significantly reducing memory and computational overhead. Third, we introduce a diffusion-based approach within a variational autoencoder framework to disentangle speaker timbre from spoken content. Leveraging a conditional diffusion model in the latent space, our method yields content-invariant speaker embeddings that are resilient to language mismatches, outperforming traditional sequential VAE techniques. Experiments on the VoxCeleb and CN-Celeb datasets demonstrate that our method effectively isolates speaker features from speech content using pre-trained speech representations. | en_US |
| dcterms.extent | xix, 117 pages : color illustrations | en_US |
| dcterms.isPartOf | PolyU Electronic Theses | en_US |
| dcterms.issued | 2025 | en_US |
| dcterms.educationalLevel | Ph.D. | en_US |
| dcterms.educationalLevel | All Doctorate | en_US |
| dcterms.LCSH | Speaker recognition | en_US |
| dcterms.LCSH | Machine learning | en_US |
| dcterms.LCSH | Hong Kong Polytechnic University -- Dissertations | en_US |
| dcterms.accessRights | open access | en_US |
Copyright Undertaking
As a bona fide Library user, I declare that:
- I will abide by the rules and legal ordinances governing copyright regarding the use of the Database.
- I will use the Database for the purpose of my research or private study only and not for circulation or further reproduction or any other purpose.
- I agree to indemnify and hold the University harmless from and against any loss, damage, cost, liability or expenses arising from copyright infringement or unauthorized usage.
By downloading any item(s) listed above, you acknowledge that you have read and understood the copyright undertaking as stated above, and agree to be bound by all of its terms.
Please use this identifier to cite or link to this item:
https://theses.lib.polyu.edu.hk/handle/200/14169

