Full metadata record
DC Field | Value | Language |
---|---|---|
dc.contributor | Department of Electronic and Information Engineering | en_US |
dc.contributor.advisor | Mak, Man-wai (EIE) | en_US |
dc.creator | Tu, Youzhi | - |
dc.identifier.uri | https://theses.lib.polyu.edu.hk/handle/200/11735 | - |
dc.language | English | en_US |
dc.publisher | Hong Kong Polytechnic University | en_US |
dc.rights | All rights reserved | en_US |
dc.title | Deep speaker embedding for robust speaker verification | en_US |
dcterms.abstract | Speaker verification (SV) aims to determine whether the speaker identity of a test utterance matches that of a target speaker. In SV, the identity of a variable-length utterance is typically represented by a fixed-dimensional vector. This vector or its modeling process is referred to as speaker embedding. Although state-of-the-art deep speaker embedding has achieved outstanding performance, deploying SV systems to adverse acoustic environments still faces a number of challenges. First, today's SV systems rely on the condition that the training and test data share the same distribution. Once this condition is violated, domain mismatch will occur. The problem will be exacerbated when the speaker embeddings violate the Gaussianity constraint. Second, because the temporal feature maps produced by the last frame-level layer are highly non-stationary, it is not desirable to use their global statistics as speaker embeddings. Third, current speaker embedding networks do not have any mechanisms to let the frame-level information flow directly into the embeddings layer, causing information loss in the pooling layer. | en_US |
dcterms.abstract | This thesis develops three strategies to address the above challenges. First, to jointly address domain mismatch and the Gaussianity requirement of probabilistic linear discriminant analysis (PLDA) models, the author proposes a variational domain adversarial learning framework with two specialized networks: variational domain adversarial neural network (VDANN) and information-maximized VDANN (InfoVDANN). Both networks leverage domain adversarial training to produce speaker discriminative and domain-invariant embeddings and apply variational autoencoders (VAEs) to perform Gaussian regularization. The InfoVDANN, in particular, avoids posterior collapse in VDANNs by preserving the mutual information (MI) between the domain-invariant embeddings and the speaker embeddings. Second, to mitigate the effect of non-stationarity in the temporal feature maps, the author proposes short-time spectral pooling (STSP) and attentive STSP to transform the temporal feature maps into the spectral domain through short-time Fourier transform (STFT). The zero-and low-frequency components are retained to preserve speaker information. A segment-level attention mechanism is designed to produce spectral representations with fewer variations, which results in better robustness to the non-stationary effect in the feature maps. Third, to allow information in the frame-level layers to flow directly to the speaker embedding layer, MI-enhanced training based on a semi-supervised deep InfoMax (DIM) framework is proposed. Because the dimensionality of the frame-level features is much larger than that of the speaker embeddings, the author proposes to squeeze the frame-level features via global pooling before MI estimation. The proposed method, called squeeze-DIM, effectively balances the dimension between the frame-level features and the speaker embeddings. | en_US |
dcterms.abstract | We evaluate the proposed methods on VoxCeleb1, VOiCES 2019, SRE16, and SRE18-CMN2. Results show that the VDANN and InfoVDANN outperform the DANN baseline, indicating the effectiveness of Gaussian regularization and MI maximization. We also observed that attentive STSP achieved the largest performance gains, suggesting that applying segment-level attention and leveraging low spectral components of temporal feature maps can produce discriminative speaker embeddings. Finally, we demonstrate that the squeeze-DIM outperforms the DIM regularization, suggesting that the squeeze operation facilitates MI maximization. | en_US |
dcterms.extent | xviii, 128 pages : color illustrations | en_US |
dcterms.isPartOf | PolyU Electronic Theses | en_US |
dcterms.issued | 2022 | en_US |
dcterms.educationalLevel | Ph.D. | en_US |
dcterms.educationalLevel | All Doctorate | en_US |
dcterms.LCSH | Voice computing | en_US |
dcterms.LCSH | Automatic speech recognition | en_US |
dcterms.LCSH | Speech processing systems | en_US |
dcterms.LCSH | Hong Kong Polytechnic University -- Dissertations | en_US |
dcterms.accessRights | open access | en_US |
Copyright Undertaking
As a bona fide Library user, I declare that:
- I will abide by the rules and legal ordinances governing copyright regarding the use of the Database.
- I will use the Database for the purpose of my research or private study only and not for circulation or further reproduction or any other purpose.
- I agree to indemnify and hold the University harmless from and against any loss, damage, cost, liability or expenses arising from copyright infringement or unauthorized usage.
By downloading any item(s) listed above, you acknowledge that you have read and understood the copyright undertaking as stated above, and agree to be bound by all of its terms.
Please use this identifier to cite or link to this item:
https://theses.lib.polyu.edu.hk/handle/200/11735