Author: Lin, Weiwei
Title: Robust speaker recognition using deep neural networks
Advisors: Mak, M. W. (EIE)
Degree: Ph.D.
Year: 2020
Subject: Automatic speech recognition
Hong Kong Polytechnic University -- Dissertations
Department: Department of Electronic and Information Engineering
Pages: xvii, 113 pages : color illustrations
Language: English
Abstract: Speaker recognition refers to recognizing a person using his/her voice. Although state-of-the-art speaker recognition systems have shown remarkable performance, there are still some unsolved problems. Firstly, speaker recognition systems' performance degrades significantly when training data and test data have domain mismatch. Domain mismatch is prevalent and is expected to happen during system deployment. This could occur when the new environment has some specific noise or involves speakers speaking different languages than training speakers. Directly using the existing system in these situations could result in poor performance. Secondly, the statistics pooling layer in state-of-the-art systems does not have rich representation power to capture the complex characteristics of frame-level features. The statistics pooling layer only uses the mean and standard deviation of frame-level features. However, mean and standard deviation are insufficient for summarizing a complex distribution. Thirdly, state-of-the-art systems still rely on a PLDA backend, which makes deployment difficult and hinders the potential of the DNN frontend. This thesis proposes several solutions to the problems mentioned above. For reducing the domain mismatch, this thesis proposes adaptation methods for both DNN frontend and PLDA backend. The proposed backend adaptation uses an auto-encoder to minimize the domain mismatch between i-vectors, while the frontend adaptation focuses on producing speaker embedding that is both discriminative and domain-invariant. Using the proposed adaptation framework, we achieve an EER of 8.69% and 7.95% in NIST SRE 2016 and 2018, respectively, which are significantly better than the previously proposed DNN adaptation methods. For better frame-level information aggregation in the DNN, this thesis proposes an attention-based statistics pooling method, which uses an expectation-maximization (EM) like algorithm to produce multiple means and standard deviations for summarizing frame-level features distribution. Applying the proposed attention mechanism to a 121-layer Densenet, we achieve an EER of 1.1% in VoxCeleb1 and an EER of 4.77% in the VOiCES 2019 evaluation set. For facilitating end-to-end speaker recognition, this thesis proposes several strategies to eliminate the need of a backend model. Experiments on NIST SRE 2016 and 2018 show that with the proposed strategies, the DNN can achieve state-of-the-art performance using simple cosine similarity and requires only half of the computational cost of the x-vector network.
Rights: All rights reserved
Access: open access

Files in This Item:
File Description SizeFormat 
5256.pdfFor All Users2.91 MBAdobe PDFView/Open


Copyright Undertaking

As a bona fide Library user, I declare that:

  1. I will abide by the rules and legal ordinances governing copyright regarding the use of the Database.
  2. I will use the Database for the purpose of my research or private study only and not for circulation or further reproduction or any other purpose.
  3. I agree to indemnify and hold the University harmless from and against any loss, damage, cost, liability or expenses arising from copyright infringement or unauthorized usage.

By downloading any item(s) listed above, you acknowledge that you have read and understood the copyright undertaking as stated above, and agree to be bound by all of its terms.

Show full item record

Please use this identifier to cite or link to this item: https://theses.lib.polyu.edu.hk/handle/200/10812