Deep speaker embedding for robust speaker verification

Tu, Youzhi

Full metadata record

DC Field	Value	Language
dc.contributor	Department of Electronic and Information Engineering	en_US
dc.contributor.advisor	Mak, Man-wai (EIE)	en_US
dc.creator	Tu, Youzhi	-
dc.identifier.uri	https://theses.lib.polyu.edu.hk/handle/200/11735	-
dc.language	English	en_US
dc.publisher	Hong Kong Polytechnic University	en_US
dc.rights	All rights reserved	en_US
dc.title	Deep speaker embedding for robust speaker verification	en_US
dcterms.abstract	Speaker verification (SV) aims to determine whether the speaker identity of a test utterance matches that of a target speaker. In SV, the identity of a variable-length utterance is typically represented by a fixed-dimensional vector. This vector or its modeling process is referred to as speaker embedding. Although state-of-the-art deep speaker embedding has achieved outstanding performance, deploying SV systems to adverse acoustic environments still faces a number of challenges. First, today's SV systems rely on the condition that the training and test data share the same distribution. Once this condition is violated, domain mismatch will occur. The problem will be exacerbated when the speaker embeddings violate the Gaussianity constraint. Second, because the temporal feature maps produced by the last frame-level layer are highly non-stationary, it is not desirable to use their global statistics as speaker embeddings. Third, current speaker embedding networks do not have any mechanisms to let the frame-level information flow directly into the embeddings layer, causing information loss in the pooling layer.	en_US
dcterms.abstract	This thesis develops three strategies to address the above challenges. First, to jointly address domain mismatch and the Gaussianity requirement of probabilistic linear discriminant analysis (PLDA) models, the author proposes a variational domain adversarial learning framework with two specialized networks: variational domain adversarial neural network (VDANN) and information-maximized VDANN (InfoVDANN). Both networks leverage domain adversarial training to produce speaker discriminative and domain-invariant embeddings and apply variational autoencoders (VAEs) to perform Gaussian regularization. The InfoVDANN, in particular, avoids posterior collapse in VDANNs by preserving the mutual information (MI) between the domain-invariant embeddings and the speaker embeddings. Second, to mitigate the effect of non-stationarity in the temporal feature maps, the author proposes short-time spectral pooling (STSP) and attentive STSP to transform the temporal feature maps into the spectral domain through short-time Fourier transform (STFT). The zero-and low-frequency components are retained to preserve speaker information. A segment-level attention mechanism is designed to produce spectral representations with fewer variations, which results in better robustness to the non-stationary effect in the feature maps. Third, to allow information in the frame-level layers to flow directly to the speaker embedding layer, MI-enhanced training based on a semi-supervised deep InfoMax (DIM) framework is proposed. Because the dimensionality of the frame-level features is much larger than that of the speaker embeddings, the author proposes to squeeze the frame-level features via global pooling before MI estimation. The proposed method, called squeeze-DIM, effectively balances the dimension between the frame-level features and the speaker embeddings.	en_US
dcterms.abstract	We evaluate the proposed methods on VoxCeleb1, VOiCES 2019, SRE16, and SRE18-CMN2. Results show that the VDANN and InfoVDANN outperform the DANN baseline, indicating the effectiveness of Gaussian regularization and MI maximization. We also observed that attentive STSP achieved the largest performance gains, suggesting that applying segment-level attention and leveraging low spectral components of temporal feature maps can produce discriminative speaker embeddings. Finally, we demonstrate that the squeeze-DIM outperforms the DIM regularization, suggesting that the squeeze operation facilitates MI maximization.	en_US
dcterms.extent	xviii, 128 pages : color illustrations	en_US
dcterms.isPartOf	PolyU Electronic Theses	en_US
dcterms.issued	2022	en_US
dcterms.educationalLevel	Ph.D.	en_US
dcterms.educationalLevel	All Doctorate	en_US
dcterms.LCSH	Voice computing	en_US
dcterms.LCSH	Automatic speech recognition	en_US
dcterms.LCSH	Speech processing systems	en_US
dcterms.LCSH	Hong Kong Polytechnic University -- Dissertations	en_US
dcterms.accessRights	open access	en_US

Files in This Item:

File	Description	Size	Format
6246.pdf	For All Users	2.56 MB	Adobe PDF	View/Open

Copyright Undertaking

As a bona fide Library user, I declare that:

I will abide by the rules and legal ordinances governing copyright regarding the use of the Database.
I will use the Database for the purpose of my research or private study only and not for circulation or further reproduction or any other purpose.
I agree to indemnify and hold the University harmless from and against any loss, damage, cost, liability or expenses arising from copyright infringement or unauthorized usage.

By downloading any item(s) listed above, you acknowledge that you have read and understood the copyright undertaking as stated above, and agree to be bound by all of its terms.

Show simple item record

Please use this identifier to cite or link to this item: https://theses.lib.polyu.edu.hk/handle/200/11735