Feature and model transformation techniques for robust speaker verification

Yiu, Kwok-kwong Michael

Full metadata record

DC Field	Value	Language
dc.contributor	Department of Electronic and Information Engineering	en_US
dc.creator	Yiu, Kwok-kwong Michael	-
dc.identifier.uri	https://theses.lib.polyu.edu.hk/handle/200/237	-
dc.language	English	en_US
dc.publisher	Hong Kong Polytechnic University	-
dc.rights	All rights reserved	en_US
dc.title	Feature and model transformation techniques for robust speaker verification	en_US
dcterms.abstract	Speaker verification is to verify the identity of a speaker based on his or her own voice. It has potential applications in securing remote access services such as phone-banking and mobile-commerce. While today's speaker verification systems perform reasonably well under controlled conditions, their performance is often compromised under real-world environments. In particular, variations in handset characteristics are known to be the major cause of performance degradation. This dissertation addresses the robustness issue of speaker verification systems in three different angles: speaker modeling, feature transformation, and model transformation. This dissertation begins with an investigation on the effectiveness of three kernel-based neural networks for speaker modeling. These networks include probabilistic decision-based neural networks (PDBNNs), Gaussian mixture models (GMMs), and elliptical basis function networks (EBFNs). Based on the thresholding mechanism of PDBNNs, the original training algorithm of PDBNNs was modified to make PDBNNs appropriate for speaker verification. Experimental results show that GMM- and PDBNN-based speaker models outperform the EBFN ones in both clean and noisy environments. It was also found that the modified learning algorithm of PDBNNs is able to find decision thresholds that reduce the variation in false acceptance rates, whereas the ad hoc threshold-determination approach used by the EBFNs and GMMs causes a large variation in the false acceptance rates. This property makes the performance of PDBNN-based systems more predictable. The effect of handset variation can be suppressed by transforming clean speech models to fit the handset-distorted speech. To this end, this dissertation proposes a model-based transformation technique that combines handset-dependent model transformation and reinforced learning. Specifically, the approach transforms the clean speaker model and clean background model to fit the distorted speech by using maximum-likelihood linear regression (MLLR), which is followed by adapting the transformed models via PDBNN's reinforced learning. It was found that MLLR is able to bring the clean models to a region close to the distorted speech and that reinforced learning is a good means of fine-tuning the transformed models to enhance the distinction between client speakers and impostors. In addition to model-based approaches, handset variation can also be suppressed by feature-based approaches. Current feature-based approaches typically identify the handset being used as one of the known handsets in a handset database and use the a priori knowledge about the identified handset to modify the features. However, it will be much more practical and cost effective if handset detector-free systems are adopted. To this end, this dissertation proposes a blind compensation algorithm to handle the situation in which no a priori knowledge about the handset is available (i.e., a handset model which is not in the handset database is used). Specifically, a composite statistical model formed by the fusion of a speaker model and a background model is used to represent the characteristics of enrollment speech. Based on the difference between the claimant's speech and the composite model, a stochastic matching type of approach is proposed to transform the claimant's speech to a region close to the enrollment speech. Therefore, the algorithm can now estimate the transformation online without the necessity of detecting the handset types. Experimental results based on the 2001 NIST Speaker Recognition evaluation set show that the proposed approach achieves significant improvement in both equal error rate and minimum detection cost as compared to cepstral mean subtraction, Znorm, and short-time Gaussianization.	en_US
dcterms.extent	xvii, 158 leaves : ill. (some col.) ; 30 cm	en_US
dcterms.isPartOf	PolyU Electronic Theses	en_US
dcterms.issued	2005	en_US
dcterms.educationalLevel	All Doctorate	en_US
dcterms.educationalLevel	Ph.D.	en_US
dcterms.LCSH	Hong Kong Polytechnic University -- Dissertations	en_US
dcterms.LCSH	Automatic speech recognition	en_US
dcterms.accessRights	open access	en_US

Files in This Item:

File	Description	Size	Format
b18099191.pdf	For All Users	5.71 MB	Adobe PDF	View/Open

Copyright Undertaking

As a bona fide Library user, I declare that:

I will abide by the rules and legal ordinances governing copyright regarding the use of the Database.
I will use the Database for the purpose of my research or private study only and not for circulation or further reproduction or any other purpose.
I agree to indemnify and hold the University harmless from and against any loss, damage, cost, liability or expenses arising from copyright infringement or unauthorized usage.

By downloading any item(s) listed above, you acknowledge that you have read and understood the copyright undertaking as stated above, and agree to be bound by all of its terms.

Show simple item record

Please use this identifier to cite or link to this item: https://theses.lib.polyu.edu.hk/handle/200/237