The applications of deep learning in robust speaker recognition

Tan, Zhili

Author:	Tan, Zhili
Title:	The applications of deep learning in robust speaker recognition
Advisors:	Mak, Man-wai (EIE)
Degree:	Ph.D.
Year:	2018
Subject:	Hong Kong Polytechnic University -- Dissertations Automatic speech recognition Machine learning
Department:	Department of Electronic and Information Engineering
Pages:	xv, 109 pages : color illustrations
Language:	English
Abstract:	Speaker verification aims to verify whether a test utterance is spoken by a target speaker. Since 2011, the i-vector approach together with probabilistic linear discriminant analysis (PLDA) have dominated this field. Under this framework, each utterance is represented by a low-dimensional i-vector that captures speaker- and channel-dependent characteristics, and the PLDA model aims to separate the speaker variability from channel variability in the i-vector space. On the other hand, in recent years, deep learning has achieved a great success in many areas, including speech recognition, computer vision, speech synthesis and music recognition. This thesis explores the applications of deep learning in speaker verification, especially under the i-vector/PLDA framework. To address the limitations of hand-crafted acoustic features, this thesis proposes a deep architecture formed by stacking a deep belief network (DBN) on top of a denoising autoencoder (DAE) for noise robust speaker identification. After backpropagation fine-tuning, the network - referred to as denoising autoencoder-deep neural network (DAE-DNN) - outputs the posterior probabilities of speakers and the top hidden layer outputs speaker-dependent bottleneck (BN) features. The autoencoder aims to reconstruct the clean spectra of a noisy test utterance using the spectra of the noisy test utterance and its SNR as input. With this denoising capability, the output from the bottleneck layer can be considered as a low-dimensional representation of the denoised utterances. These frame-based bottleneck features are then used to train an i-vector extractor and a PLDA model for speaker identification. Experimental results based on a noise-contaminated YOHO corpus show that the bottleneck features outperform the conventional MFCC under low SNR conditions and that the fusion of the two features leads to further performance gain, suggesting that the two features are complementary to each other. A limitation of the above network is that the BN feature vectors tend to be very similar across the whole utterance, causing numerical difficulty when training the UBM and the i-vector extractor. This problem, however, can be overcome by training the DAE-DNN to produce senone posteriors instead of speaker posteriors. The resulting DAE-DNN produces not only denoised BN features, but also senone posteriors from which a senone i-vector extractor can be trained and senone i-vectors can be extracted. Because the frame-based BN features are now aligned to senone clusters instead of acoustic clusters, the resulting i-vectors characterize how individual speakers pronounce different phones, which allows more precise comparisons between speakers. Through extensive evaluations on NIST 2012 SRE, this thesis demonstrates that senone i-vectors outperform conventional GMM i-vectors. More interestingly, the BN features are not only phonetically discriminative, results suggest that they also contain sufficient speaker information to produce BN-based senone i-vectors that outperform the conventional senone i-vectors. This thesis also shows that DAE training is more beneficial to BN feature extraction than senone posterior estimation. Although the denoised BN-based senone i-vectors improve the noise robustness significantly compared to the MFCC-GMM ones, adverse acoustic conditions and duration variability in utterances could still have detrimental effect on PLDA scores. This thesis also proposes and investigates several DNN-based PLDA score compensation, transformation and calibration algorithms for enhancing the noise robustness of i-vector/PLDA systems. Unlike conventional calibration methods where the required score shift is a linear function of SNR or log-duration, the DNN approach learns the complex relationship between the score shifts and the combination of i-vector pairs and uncalibrated scores. Furthermore, with the exibility of DNNs, it is possible to explicitly train a DNN to recover the clean scores without having to estimate the score shifts. To alleviate the overfitting problem, multi-task learning is applied to incorporate auxiliary information such as SNRs and speaker ID of training utterances into the DNN. Experiments on NIST 2012 SRE show that score calibration derived from multi-task DNNs can improve the performance of the conventional score-shift approach significantly, especially under noisy conditions.
Rights:	All rights reserved
Access:	open access

Files in This Item:

File	Description	Size	Format
991022163355703411.pdf	For All Users	4.32 MB	Adobe PDF	View/Open

Copyright Undertaking

As a bona fide Library user, I declare that:

I will abide by the rules and legal ordinances governing copyright regarding the use of the Database.
I will use the Database for the purpose of my research or private study only and not for circulation or further reproduction or any other purpose.
I agree to indemnify and hold the University harmless from and against any loss, damage, cost, liability or expenses arising from copyright infringement or unauthorized usage.

By downloading any item(s) listed above, you acknowledge that you have read and understood the copyright undertaking as stated above, and agree to be bound by all of its terms.

Show full item record

Please use this identifier to cite or link to this item: https://theses.lib.polyu.edu.hk/handle/200/9625