Full metadata record
DC FieldValueLanguage
dc.contributorDepartment of Electronic and Information Engineeringen_US
dc.contributor.advisorMak, M. W. (EIE)-
dc.creatorYao, Qi-
dc.identifier.urihttps://theses.lib.polyu.edu.hk/handle/200/9571-
dc.languageEnglishen_US
dc.publisherHong Kong Polytechnic University-
dc.rightsAll rights reserveden_US
dc.titleSNR-invariant deep neural networks using multi-task learning for robust I-vector speaker verificationen_US
dcterms.abstractText-independent speaker verification (SV) is a binary classification task that aims to verify the identity of speakers through analyzing and classifying their voices. The i-vector feature representation together with the probabilistic linear discriminant analysis (PLDA) backend have achieved state-of-the-art performance. However, it is still challenging when the i-vector/PLDA framework is applied to real-world noisy environments. This is because i-vectors represent all kinds of variabilities in the total variability space. This dissertation shows that i-vectors form clusters according to the SNR level of utterances. In light of this SNR-dependent clustering phenomenon, we propose three deep neural networks (DNN) to compensate for the channel-and SNR-variabilities directly in the i-vector space. These three DNNs are named as Regression DNN (RDNN), Hierarchical Regression DNNs (H-RDNNs) and Multi-Task DNN (MT-DNN), respectively. The RDNN takes noisy i-vectors as input and maps them to speaker-dependent cluster means. The H-RDNNs are formed by stacking a second regression DNN on top of the RDNN. The second stage of the H-RDNN aims to regularize the outliers that cannot be denoised properly by the RDNN. The MT-DNN makes use of an extra speaker classification task as an auxiliary task to retain speaker information in the denoised i-vectors. The secondary task of the MT-DNN is trained with a primary (regression) task using an alternating-backpropagation algorithm. We found that among all DNN-based denoising models, the MT-DNN achieves the best performance for denoising the noisy i-vectors. Experiments based on NIST 2012 SRE suggest that DNN-based approaches together with the PLDA backend significantly outperforms the multi-condition PLDA model and mixture of PLDA models. Furthermore, the MT-DNN achieves considerable improvements with 23% reduction in EER and 9% reduction in minDCF on average in Common Condition (CC) 4 and 5, even in an SNR mismatch condition.en_US
dcterms.extentxvi, 92 pages : color illustrationsen_US
dcterms.isPartOfPolyU Electronic Thesesen_US
dcterms.issued2018en_US
dcterms.educationalLevelM.Sc.en_US
dcterms.educationalLevelAll Masteren_US
dcterms.LCSHHong Kong Polytechnic University -- Dissertationsen_US
dcterms.LCSHAutomatic speech recognitionen_US
dcterms.LCSHSpeech processing systemsen_US
dcterms.accessRightsrestricted accessen_US

Files in This Item:
File Description SizeFormat 
991022144624903411.pdfFor All Users (off-campus access for PolyU Staff & Students only)6.34 MBAdobe PDFView/Open


Copyright Undertaking

As a bona fide Library user, I declare that:

  1. I will abide by the rules and legal ordinances governing copyright regarding the use of the Database.
  2. I will use the Database for the purpose of my research or private study only and not for circulation or further reproduction or any other purpose.
  3. I agree to indemnify and hold the University harmless from and against any loss, damage, cost, liability or expenses arising from copyright infringement or unauthorized usage.

By downloading any item(s) listed above, you acknowledge that you have read and understood the copyright undertaking as stated above, and agree to be bound by all of its terms.

Show simple item record

Please use this identifier to cite or link to this item: https://theses.lib.polyu.edu.hk/handle/200/9571