Semi-supervised and adversarial domain adaptation for speaker recognition

Li, Longxin

Full metadata record

DC Field	Value	Language
dc.contributor	Department of Electronic and Information Engineering	en_US
dc.contributor.advisor	Mak, Man-wai (EIE)	-
dc.creator	Li, Longxin	-
dc.identifier.uri	https://theses.lib.polyu.edu.hk/handle/200/10498	-
dc.language	English	en_US
dc.publisher	Hong Kong Polytechnic University	-
dc.rights	All rights reserved	en_US
dc.title	Semi-supervised and adversarial domain adaptation for speaker recognition	en_US
dcterms.abstract	The rapid development of technology has driven the society into a new era of AI in which speaker recognition is one of the essential techniques. Due to the unique characteristics of voiceprints, speaker recognition has been used for enhancing the security level of banking and personal security systems. Despite the great convenience provided by speaker recognition technology, some fundamental problems are remaining unsolved, which include (1) insufficient labeled samples from new acoustic environments for training supervised machine learning models and (2) domain mismatch among different acoustic environments. These fundamental problems may result in severe performance degradation in speaker recognition systems. We proposed two methods to address the above problems. First, to reduce domain mismatch in speaker verification systems, we propose an unsupervised domain adaptation method. Second, to enhance speaker identification performance, we introduce a contrastive adversarial domain adaptation network to create a domain-invariant feature space. The first method addresses the data sparsity issue by applying spectral clustering on in-domain unlabeled data to obtain hypothesized speaker labels for adapting an out-of-domain PLDA mixture model to the target domain. To further refine the target PLDA mixture model, spectral clustering is iteratively applied to the new PLDA score matrix to produce a new set of hypothesized speaker labels. A gender-aware deep neural network (DNN) is trained to produce gender posteriors given an i-vector. The gender posteriors then replace the posterior probabilities of the indicator variables in the PLDA mixture model. A gender-dependent inter dataset variability compensation (GD-IDVC) is implemented to reduce the mismatch between the i-vectors obtained from the in-domain and out-of-domain datasets. Evaluations based on NIST 2016 SRE show that at the end of the iterative re-training, the PLDA mixture model becomes fully adapted to the new domain. Results also show that the PLDA scores can be readily incorporated into spectral clustering, resulting in high-quality speaker clusters that could not be possibly achieved by agglomerative hierarchical clustering.	en_US
dcterms.abstract	The second method aims to reduce the mismatch between male and female speakers through adversarial domain adaptation. The method mitigates an intrinsic drawback of the domain adversarial network by splitting the feature extractor into two contrastive branches, with one branch delegating for the class-dependence in the latent space and another branch focusing on domain-invariance. The feature extractor achieves these contrastive goals by sharing the first and the last hidden layers but having the decoupled branches in the middle hidden layers. We adversarially trained the label predictor to produce equal posterior probabilities across all of its outputs instead of producing one-hot outputs to ensure that the feature extractor can produce class-discriminative embedded features. We refer to the resulting domain adaptation network as a contrastive adversarial domain adaptation network (CADAN). We evaluated the domain-invariance of the embedded features via a series of speaker identifcation experiments under both clean and noisy conditions. Results demonstrate that the embedded features produced by CADAN lead to 8.9% and 77.6% improvement in speaker identification accuracy when compared with the conventional DAN under clean and noisy conditions, respectively.	en_US
dcterms.extent	vi, 64 pages : color illustrations	en_US
dcterms.isPartOf	PolyU Electronic Theses	en_US
dcterms.issued	2020	en_US
dcterms.educationalLevel	M.Phil.	en_US
dcterms.educationalLevel	All Master	en_US
dcterms.LCSH	Speech processing systems	en_US
dcterms.LCSH	Pattern recognition systems	en_US
dcterms.LCSH	Hong Kong Polytechnic University -- Dissertations	en_US
dcterms.accessRights	open access	en_US

Files in This Item:

File	Description	Size	Format
991022385554703411.pdf	For All Users	1.68 MB	Adobe PDF	View/Open

Copyright Undertaking

As a bona fide Library user, I declare that:

I will abide by the rules and legal ordinances governing copyright regarding the use of the Database.
I will use the Database for the purpose of my research or private study only and not for circulation or further reproduction or any other purpose.
I agree to indemnify and hold the University harmless from and against any loss, damage, cost, liability or expenses arising from copyright infringement or unauthorized usage.

By downloading any item(s) listed above, you acknowledge that you have read and understood the copyright undertaking as stated above, and agree to be bound by all of its terms.

Show simple item record

Please use this identifier to cite or link to this item: https://theses.lib.polyu.edu.hk/handle/200/10498