|Deep speaker embedding for far-field speaker recognition
|Mak, M. W. (EIE)
|Automatic speech recognition
Hong Kong Polytechnic University -- Dissertations
|Department of Electronic and Information Engineering
|vii, 81 pages : color illustrations
|Speaker recognition is considered to be one of the important directions in the future development of artificial intelligence. With the progress of speaker recognition technology, people are expected to communicate with machines through speech directly, so speaker recognition technology has attracted increasing attention. Because of the growing need for far-field speaker recognition, speaker recognition under complex acoustic environments, which may contain background noises, reverberations, and/or voice interferences, has great potential. Therefore, how to improve the performance of far-field speaker recognition is a challenging task The main contributions of this dissertation are listed as follows. Interference from reverberation and noise could cause a great influence on the quality of the speech and the performance of speaker recognition, and this dissertation presents a speaker enhancement algorithm based on Weighted Prediction Error (WPE) and Dual-signal Transformation LSTM Network (DTLN). WPE can quickly and accurately complete the task of dereverberation, and DTLN is capable of real-time noise suppression. WPE requires long speech segments, and DTLN could process short speech segments. This dissertation utilizes WPE to establish a linear mapping from original speech to de-reverberated speech and adopts DTLN to build a nonlinear mapping from de-reverberated speech to enhanced speech, which are expected to improve the effects of dereverberation and denoising. The experiment results show that the proposed algorithm could extract enhanced speech signals from the far-field signal disturbed by reverberation and noise effectively, thereby improving the performance of speaker recognition. X-vector-based speaker recognition systems could achieve good performance in text-independent tasks, which are more robust than traditional speaker recognition systems. Deep residual networks (ResNet) could enable the training of neural networks with great depth by using shortcut connections, so this dissertation presents an improved speaker recognition algorithm that extracts x-vector by combining time-delayed neural networks (TDNN) and deep ResNets, and then verifies speakers based on Probabilistic Linear Discriminant Analysis (PLDA) on the resulting x-vector. This dissertation further improves the deep ResNet model by introducing multi-scale features into the model that extracts and fuses the features from different scales. The experiment results show that the TDNN and multi-scale deep ResNet structure achieved best result in accuracy and robustness, comparing with the TDNN and convolutional neural network (CNN) based structure and TDNN and deep ResNet based structure.
|All rights reserved
Files in This Item:
|For All Users (off-campus access for PolyU Staff & Students only)
As a bona fide Library user, I declare that:
- I will abide by the rules and legal ordinances governing copyright regarding the use of the Database.
- I will use the Database for the purpose of my research or private study only and not for circulation or further reproduction or any other purpose.
- I agree to indemnify and hold the University harmless from and against any loss, damage, cost, liability or expenses arising from copyright infringement or unauthorized usage.
By downloading any item(s) listed above, you acknowledge that you have read and understood the copyright undertaking as stated above, and agree to be bound by all of its terms.
Please use this identifier to cite or link to this item: