SNR-invariant deep neural networks using multi-task learning for robust I-vector speaker verification

Yao, Qi

Full metadata record

DC Field	Value	Language
dc.contributor	Department of Electronic and Information Engineering	en_US
dc.contributor.advisor	Mak, M. W. (EIE)	-
dc.creator	Yao, Qi	-
dc.identifier.uri	https://theses.lib.polyu.edu.hk/handle/200/9571	-
dc.language	English	en_US
dc.publisher	Hong Kong Polytechnic University	-
dc.rights	All rights reserved	en_US
dc.title	SNR-invariant deep neural networks using multi-task learning for robust I-vector speaker verification	en_US
dcterms.abstract	Text-independent speaker verification (SV) is a binary classification task that aims to verify the identity of speakers through analyzing and classifying their voices. The i-vector feature representation together with the probabilistic linear discriminant analysis (PLDA) backend have achieved state-of-the-art performance. However, it is still challenging when the i-vector/PLDA framework is applied to real-world noisy environments. This is because i-vectors represent all kinds of variabilities in the total variability space. This dissertation shows that i-vectors form clusters according to the SNR level of utterances. In light of this SNR-dependent clustering phenomenon, we propose three deep neural networks (DNN) to compensate for the channel-and SNR-variabilities directly in the i-vector space. These three DNNs are named as Regression DNN (RDNN), Hierarchical Regression DNNs (H-RDNNs) and Multi-Task DNN (MT-DNN), respectively. The RDNN takes noisy i-vectors as input and maps them to speaker-dependent cluster means. The H-RDNNs are formed by stacking a second regression DNN on top of the RDNN. The second stage of the H-RDNN aims to regularize the outliers that cannot be denoised properly by the RDNN. The MT-DNN makes use of an extra speaker classification task as an auxiliary task to retain speaker information in the denoised i-vectors. The secondary task of the MT-DNN is trained with a primary (regression) task using an alternating-backpropagation algorithm. We found that among all DNN-based denoising models, the MT-DNN achieves the best performance for denoising the noisy i-vectors. Experiments based on NIST 2012 SRE suggest that DNN-based approaches together with the PLDA backend significantly outperforms the multi-condition PLDA model and mixture of PLDA models. Furthermore, the MT-DNN achieves considerable improvements with 23% reduction in EER and 9% reduction in minDCF on average in Common Condition (CC) 4 and 5, even in an SNR mismatch condition.	en_US
dcterms.extent	xvi, 92 pages : color illustrations	en_US
dcterms.isPartOf	PolyU Electronic Theses	en_US
dcterms.issued	2018	en_US
dcterms.educationalLevel	M.Sc.	en_US
dcterms.educationalLevel	All Master	en_US
dcterms.LCSH	Hong Kong Polytechnic University -- Dissertations	en_US
dcterms.LCSH	Automatic speech recognition	en_US
dcterms.LCSH	Speech processing systems	en_US
dcterms.accessRights	restricted access	en_US

Files in This Item:

File	Description	Size	Format
991022144624903411.pdf	For All Users (off-campus access for PolyU Staff & Students only)	6.34 MB	Adobe PDF	View/Open

Copyright Undertaking

As a bona fide Library user, I declare that:

I will abide by the rules and legal ordinances governing copyright regarding the use of the Database.
I will use the Database for the purpose of my research or private study only and not for circulation or further reproduction or any other purpose.
I agree to indemnify and hold the University harmless from and against any loss, damage, cost, liability or expenses arising from copyright infringement or unauthorized usage.

By downloading any item(s) listed above, you acknowledge that you have read and understood the copyright undertaking as stated above, and agree to be bound by all of its terms.

Show simple item record

Please use this identifier to cite or link to this item: https://theses.lib.polyu.edu.hk/handle/200/9571