Robust speaker recognition using deep neural networks

Lin, Weiwei

Full metadata record

DC Field	Value	Language
dc.contributor	Department of Electronic and Information Engineering	en_US
dc.contributor.advisor	Mak, M. W. (EIE)	en_US
dc.creator	Lin, Weiwei	-
dc.identifier.uri	https://theses.lib.polyu.edu.hk/handle/200/10812	-
dc.language	English	en_US
dc.publisher	Hong Kong Polytechnic University	en_US
dc.rights	All rights reserved	en_US
dc.title	Robust speaker recognition using deep neural networks	en_US
dcterms.abstract	Speaker recognition refers to recognizing a person using his/her voice. Although state-of-the-art speaker recognition systems have shown remarkable performance, there are still some unsolved problems. Firstly, speaker recognition systems' performance degrades significantly when training data and test data have domain mismatch. Domain mismatch is prevalent and is expected to happen during system deployment. This could occur when the new environment has some specific noise or involves speakers speaking different languages than training speakers. Directly using the existing system in these situations could result in poor performance. Secondly, the statistics pooling layer in state-of-the-art systems does not have rich representation power to capture the complex characteristics of frame-level features. The statistics pooling layer only uses the mean and standard deviation of frame-level features. However, mean and standard deviation are insufficient for summarizing a complex distribution. Thirdly, state-of-the-art systems still rely on a PLDA backend, which makes deployment difficult and hinders the potential of the DNN frontend. This thesis proposes several solutions to the problems mentioned above. For reducing the domain mismatch, this thesis proposes adaptation methods for both DNN frontend and PLDA backend. The proposed backend adaptation uses an auto-encoder to minimize the domain mismatch between i-vectors, while the frontend adaptation focuses on producing speaker embedding that is both discriminative and domain-invariant. Using the proposed adaptation framework, we achieve an EER of 8.69% and 7.95% in NIST SRE 2016 and 2018, respectively, which are significantly better than the previously proposed DNN adaptation methods. For better frame-level information aggregation in the DNN, this thesis proposes an attention-based statistics pooling method, which uses an expectation-maximization (EM) like algorithm to produce multiple means and standard deviations for summarizing frame-level features distribution. Applying the proposed attention mechanism to a 121-layer Densenet, we achieve an EER of 1.1% in VoxCeleb1 and an EER of 4.77% in the VOiCES 2019 evaluation set. For facilitating end-to-end speaker recognition, this thesis proposes several strategies to eliminate the need of a backend model. Experiments on NIST SRE 2016 and 2018 show that with the proposed strategies, the DNN can achieve state-of-the-art performance using simple cosine similarity and requires only half of the computational cost of the x-vector network.	en_US
dcterms.extent	xvii, 113 pages : color illustrations	en_US
dcterms.isPartOf	PolyU Electronic Theses	en_US
dcterms.issued	2020	en_US
dcterms.educationalLevel	Ph.D.	en_US
dcterms.educationalLevel	All Doctorate	en_US
dcterms.LCSH	Automatic speech recognition	en_US
dcterms.LCSH	Hong Kong Polytechnic University -- Dissertations	en_US
dcterms.accessRights	open access	en_US

Files in This Item:

File	Description	Size	Format
5256.pdf	For All Users	2.91 MB	Adobe PDF	View/Open

Copyright Undertaking

As a bona fide Library user, I declare that:

I will abide by the rules and legal ordinances governing copyright regarding the use of the Database.
I will use the Database for the purpose of my research or private study only and not for circulation or further reproduction or any other purpose.
I agree to indemnify and hold the University harmless from and against any loss, damage, cost, liability or expenses arising from copyright infringement or unauthorized usage.

By downloading any item(s) listed above, you acknowledge that you have read and understood the copyright undertaking as stated above, and agree to be bound by all of its terms.

Show simple item record

Please use this identifier to cite or link to this item: https://theses.lib.polyu.edu.hk/handle/200/10812