Maximal speaker separability via robust speaker representation learning

Li, Zhe

Full metadata record

DC Field	Value	Language
dc.contributor	Department of Electrical and Electronic Engineering	en_US
dc.contributor.advisor	Mak, Man-wai (EEE)	en_US
dc.creator	Li, Zhe	-
dc.identifier.uri	https://theses.lib.polyu.edu.hk/handle/200/14169	-
dc.language	English	en_US
dc.publisher	Hong Kong Polytechnic University	en_US
dc.rights	All rights reserved	en_US
dc.title	Maximal speaker separability via robust speaker representation learning	en_US
dcterms.abstract	Speaker representation learning aims to extract compact, discriminative embeddings that encapsulate unique vocal characteristics regardless of linguistic content or environmental conditions. The objective is to learn an embedding space with two key properties: same-class compactness, where embeddings from the same speaker are closely clustered, and different-class dispersion, where embeddings from different speakers are well separated. However, existing methods face several challenges. First, conventional speaker verification methods treat the task as a classification problem, relying on softmax-based loss functions to maximize inter-class differences. However, these loss functions often struggle to reduce intra-class variation. Second, directly applying a pre-trained model to speaker verification can only achieve sub-optimal performance because the pre-trained model is insufficient for extracting task-specific features, leading to limited transferability. Full fine-tuning of these models introduces significant computational and storage costs while risking catastrophic forgetting. Third, while the pre-trained speech models offer robust feature representations, their effectiveness relies on an unrealistic assumption: the speaker identity information and the linguistic content in the representations can be easily disentangled.	en_US
dcterms.abstract	To address these challenges, we propose three key solutions in this thesis. First, we propose a supervised contrastive learning framework incorporating an additive angular margin to effectively reduce intra-class variation. By maximizing the mutual information between frame-level features and speaker representations, our method preserves nonshared speaker information across diverse augmentations. Extensive evaluations on CN-Celeb, VoxCeleb, and CU-MARVEL datasets demonstrate that the resulting ECAPA-TDNN embedding space exhibits robust inter-speaker separability and intra-speaker consistency. Second, we investigate parameter-efficient fine-tuning strategies for pre-trained Transformer models in speaker verification. By integrating dynamic prompt tuning—where prompts are clustered based on speaker-specific traits—and incorporating spectral information into a LoRA-based adaptation process, our approach efficiently captures task-relevant features while significantly reducing memory and computational overhead. Third, we introduce a diffusion-based approach within a variational autoencoder framework to disentangle speaker timbre from spoken content. Leveraging a conditional diffusion model in the latent space, our method yields content-invariant speaker embeddings that are resilient to language mismatches, outperforming traditional sequential VAE techniques. Experiments on the VoxCeleb and CN-Celeb datasets demonstrate that our method effectively isolates speaker features from speech content using pre-trained speech representations.	en_US
dcterms.extent	xix, 117 pages : color illustrations	en_US
dcterms.isPartOf	PolyU Electronic Theses	en_US
dcterms.issued	2025	en_US
dcterms.educationalLevel	Ph.D.	en_US
dcterms.educationalLevel	All Doctorate	en_US
dcterms.LCSH	Speaker recognition	en_US
dcterms.LCSH	Machine learning	en_US
dcterms.LCSH	Hong Kong Polytechnic University -- Dissertations	en_US
dcterms.accessRights	open access	en_US

Files in This Item:

File	Description	Size	Format
8624.pdf	For All Users	3.58 MB	Adobe PDF	View/Open

Copyright Undertaking

As a bona fide Library user, I declare that:

I will abide by the rules and legal ordinances governing copyright regarding the use of the Database.
I will use the Database for the purpose of my research or private study only and not for circulation or further reproduction or any other purpose.
I agree to indemnify and hold the University harmless from and against any loss, damage, cost, liability or expenses arising from copyright infringement or unauthorized usage.

By downloading any item(s) listed above, you acknowledge that you have read and understood the copyright undertaking as stated above, and agree to be bound by all of its terms.

Show simple item record

Please use this identifier to cite or link to this item: https://theses.lib.polyu.edu.hk/handle/200/14169