Full metadata record
DC FieldValueLanguage
dc.contributorDepartment of Electronic and Information Engineeringen_US
dc.creatorYiu, Kwok-kwong Michael-
dc.identifier.urihttps://theses.lib.polyu.edu.hk/handle/200/237-
dc.languageEnglishen_US
dc.publisherHong Kong Polytechnic University-
dc.rightsAll rights reserveden_US
dc.titleFeature and model transformation techniques for robust speaker verificationen_US
dcterms.abstractSpeaker verification is to verify the identity of a speaker based on his or her own voice. It has potential applications in securing remote access services such as phone-banking and mobile-commerce. While today's speaker verification systems perform reasonably well under controlled conditions, their performance is often compromised under real-world environments. In particular, variations in handset characteristics are known to be the major cause of performance degradation. This dissertation addresses the robustness issue of speaker verification systems in three different angles: speaker modeling, feature transformation, and model transformation. This dissertation begins with an investigation on the effectiveness of three kernel-based neural networks for speaker modeling. These networks include probabilistic decision-based neural networks (PDBNNs), Gaussian mixture models (GMMs), and elliptical basis function networks (EBFNs). Based on the thresholding mechanism of PDBNNs, the original training algorithm of PDBNNs was modified to make PDBNNs appropriate for speaker verification. Experimental results show that GMM- and PDBNN-based speaker models outperform the EBFN ones in both clean and noisy environments. It was also found that the modified learning algorithm of PDBNNs is able to find decision thresholds that reduce the variation in false acceptance rates, whereas the ad hoc threshold-determination approach used by the EBFNs and GMMs causes a large variation in the false acceptance rates. This property makes the performance of PDBNN-based systems more predictable. The effect of handset variation can be suppressed by transforming clean speech models to fit the handset-distorted speech. To this end, this dissertation proposes a model-based transformation technique that combines handset-dependent model transformation and reinforced learning. Specifically, the approach transforms the clean speaker model and clean background model to fit the distorted speech by using maximum-likelihood linear regression (MLLR), which is followed by adapting the transformed models via PDBNN's reinforced learning. It was found that MLLR is able to bring the clean models to a region close to the distorted speech and that reinforced learning is a good means of fine-tuning the transformed models to enhance the distinction between client speakers and impostors. In addition to model-based approaches, handset variation can also be suppressed by feature-based approaches. Current feature-based approaches typically identify the handset being used as one of the known handsets in a handset database and use the a priori knowledge about the identified handset to modify the features. However, it will be much more practical and cost effective if handset detector-free systems are adopted. To this end, this dissertation proposes a blind compensation algorithm to handle the situation in which no a priori knowledge about the handset is available (i.e., a handset model which is not in the handset database is used). Specifically, a composite statistical model formed by the fusion of a speaker model and a background model is used to represent the characteristics of enrollment speech. Based on the difference between the claimant's speech and the composite model, a stochastic matching type of approach is proposed to transform the claimant's speech to a region close to the enrollment speech. Therefore, the algorithm can now estimate the transformation online without the necessity of detecting the handset types. Experimental results based on the 2001 NIST Speaker Recognition evaluation set show that the proposed approach achieves significant improvement in both equal error rate and minimum detection cost as compared to cepstral mean subtraction, Znorm, and short-time Gaussianization.en_US
dcterms.extentxvii, 158 leaves : ill. (some col.) ; 30 cmen_US
dcterms.isPartOfPolyU Electronic Thesesen_US
dcterms.issued2005en_US
dcterms.educationalLevelAll Doctorateen_US
dcterms.educationalLevelPh.D.en_US
dcterms.LCSHHong Kong Polytechnic University -- Dissertationsen_US
dcterms.LCSHAutomatic speech recognitionen_US
dcterms.accessRightsopen accessen_US

Files in This Item:
File Description SizeFormat 
b18099191.pdfFor All Users5.71 MBAdobe PDFView/Open


Copyright Undertaking

As a bona fide Library user, I declare that:

  1. I will abide by the rules and legal ordinances governing copyright regarding the use of the Database.
  2. I will use the Database for the purpose of my research or private study only and not for circulation or further reproduction or any other purpose.
  3. I agree to indemnify and hold the University harmless from and against any loss, damage, cost, liability or expenses arising from copyright infringement or unauthorized usage.

By downloading any item(s) listed above, you acknowledge that you have read and understood the copyright undertaking as stated above, and agree to be bound by all of its terms.

Show simple item record

Please use this identifier to cite or link to this item: https://theses.lib.polyu.edu.hk/handle/200/237