Feature and model transformation techniques for robust speaker verification

Yiu, Kwok-kwong Michael

Author:	Yiu, Kwok-kwong Michael
Title:	Feature and model transformation techniques for robust speaker verification
Degree:	Ph.D.
Year:	2005
Subject:	Hong Kong Polytechnic University -- Dissertations Automatic speech recognition
Department:	Department of Electronic and Information Engineering
Pages:	xvii, 158 leaves : ill. (some col.) ; 30 cm
Language:	English
Abstract:	Speaker verification is to verify the identity of a speaker based on his or her own voice. It has potential applications in securing remote access services such as phone-banking and mobile-commerce. While today's speaker verification systems perform reasonably well under controlled conditions, their performance is often compromised under real-world environments. In particular, variations in handset characteristics are known to be the major cause of performance degradation. This dissertation addresses the robustness issue of speaker verification systems in three different angles: speaker modeling, feature transformation, and model transformation. This dissertation begins with an investigation on the effectiveness of three kernel-based neural networks for speaker modeling. These networks include probabilistic decision-based neural networks (PDBNNs), Gaussian mixture models (GMMs), and elliptical basis function networks (EBFNs). Based on the thresholding mechanism of PDBNNs, the original training algorithm of PDBNNs was modified to make PDBNNs appropriate for speaker verification. Experimental results show that GMM- and PDBNN-based speaker models outperform the EBFN ones in both clean and noisy environments. It was also found that the modified learning algorithm of PDBNNs is able to find decision thresholds that reduce the variation in false acceptance rates, whereas the ad hoc threshold-determination approach used by the EBFNs and GMMs causes a large variation in the false acceptance rates. This property makes the performance of PDBNN-based systems more predictable. The effect of handset variation can be suppressed by transforming clean speech models to fit the handset-distorted speech. To this end, this dissertation proposes a model-based transformation technique that combines handset-dependent model transformation and reinforced learning. Specifically, the approach transforms the clean speaker model and clean background model to fit the distorted speech by using maximum-likelihood linear regression (MLLR), which is followed by adapting the transformed models via PDBNN's reinforced learning. It was found that MLLR is able to bring the clean models to a region close to the distorted speech and that reinforced learning is a good means of fine-tuning the transformed models to enhance the distinction between client speakers and impostors. In addition to model-based approaches, handset variation can also be suppressed by feature-based approaches. Current feature-based approaches typically identify the handset being used as one of the known handsets in a handset database and use the a priori knowledge about the identified handset to modify the features. However, it will be much more practical and cost effective if handset detector-free systems are adopted. To this end, this dissertation proposes a blind compensation algorithm to handle the situation in which no a priori knowledge about the handset is available (i.e., a handset model which is not in the handset database is used). Specifically, a composite statistical model formed by the fusion of a speaker model and a background model is used to represent the characteristics of enrollment speech. Based on the difference between the claimant's speech and the composite model, a stochastic matching type of approach is proposed to transform the claimant's speech to a region close to the enrollment speech. Therefore, the algorithm can now estimate the transformation online without the necessity of detecting the handset types. Experimental results based on the 2001 NIST Speaker Recognition evaluation set show that the proposed approach achieves significant improvement in both equal error rate and minimum detection cost as compared to cepstral mean subtraction, Znorm, and short-time Gaussianization.
Rights:	All rights reserved
Access:	open access

Files in This Item:

File	Description	Size	Format
b18099191.pdf	For All Users	5.71 MB	Adobe PDF	View/Open

Copyright Undertaking

As a bona fide Library user, I declare that:

I will abide by the rules and legal ordinances governing copyright regarding the use of the Database.
I will use the Database for the purpose of my research or private study only and not for circulation or further reproduction or any other purpose.
I agree to indemnify and hold the University harmless from and against any loss, damage, cost, liability or expenses arising from copyright infringement or unauthorized usage.

By downloading any item(s) listed above, you acknowledge that you have read and understood the copyright undertaking as stated above, and agree to be bound by all of its terms.

Show full item record

Please use this identifier to cite or link to this item: https://theses.lib.polyu.edu.hk/handle/200/237