Author: Rao, Wei
Title: Utterance partitioning for supervector and i-vector speaker verification
Advisors: Mak, Man-wai (EIE)
Degree: Ph.D.
Year: 2015
Subject: Automatic speech recognition.
Signal, Image and Speech Processing.
Hong Kong Polytechnic University -- Dissertations
Department: Department of Electronic and Information Engineering
Pages: xxiv, 173 pages : illustrations (some color) ; 30 cm
Language: English
Abstract: In recent years, GMMSVM and i-vectors with probabilistic linear discriminant analysis (PLDA) have become prominent approaches to text-independent speaker verification. The idea of GMMSVM is to derive a GMM-supervector by stacking the mean vectors of a target-speaker dependent, MAP-adapted GMM. The supervector is then presented to a speaker-dependent support vector machine (SVM) for scoring. However, a problematic issue of this approach is the severe imbalance between the numbers of speaker-class and impostor-class utterances available for training the speaker-dependent SVMs. Different from high dimension GMM-supervectors, the major advantage of i-vectors is that they can represent speaker-dependent information in a low-dimension space, which opens up opportunity for using statistical techniques such as linear discriminant analysis (LDA), within-class covariance normalization (WCCN), and PLDA to suppress the channel-and session-variability. While these techniques have achieved state-of-the-art performance in recent NIST Speaker Recognition Evaluations (SREs), they require multiple training speakers each providing sufficient numbers of sessions to train the transformation matrices or loading matrices. However, collecting such a corpus is expensive and inconvenient. In a typical training dataset, the number of speakers could be fairly large, but the number of speakers who can provide many sessions is quite limited. The lack of multiple sessions per speaker could cause numerical problems in the within speaker scatter matrix, a problematic issue known as the small sample-size problem in the literature. Although the above-mentioned data imbalance problem and small sample-size problem are caused by different reasons, both of them can be overcome by an utterance partitioning and resampling technique proposed in this thesis. Specifically, the sequence order of acoustic vectors in an enrollment utterance is first randomized; then the randomized sequence is partitioned into a number of segments. Each of these segments is then used to compute a GMM-supervector or an i-vector. A desirable number of supervectors/i-vectors can be produced by repeating this randomization and partitioning process a number of times. This method is referred to as utterance partitioning with acoustic vector resampling (UPAVR). Experiments on the NIST 2002, 2004 and 2010 SREs show that UPAVR can help the SVM training algorithm to find better decision boundaries so that SVM scoring outperforms other speaker comparison methods such as cosine distance scoring. Furthermore, results demonstrate that UPAVR can enhance the capability of LDA and WCCN in suppressing session variability, especially when the number of conversations per training speaker is limited.
This thesis also proposes a new channel compensation method called multi-way LDA that uses not only the speaker labels but also microphone labels in the training i-vectors for estimating the LDA projection matrix. It was found that the method can strengthen the discriminative capability of LDA and overcome the small sample-size problem. To overcome the implicit use of background information in the conventional PLDA scoring in i-vector speaker verification, this thesis proposes a method called PLDA-SVM scoring that uses empirical kernel maps to create a PLDA score space for each target speaker and train an SVM that operates in the score space to produce verification scores. Given a test i-vector and the identity of the target speaker under test, a score vector is constructed by computing the PLDA scores of the test i-vector with respect to the target-speaker{174}s i-vectors and a set of nontarget-speakers' i-vectors. As a result, the bases of the score space are divided into two parts: one defined by the target-speaker{174}s i-vectors and another defined by the nontarget-speakers{174} i-vectors. To ensure a proper balance between the two parts, utterance partitioning is applied to create multiple target-speaker{174}s i-vectors from a single or a small number of utterances. With the new protocol brought by NIST SRE, this thesis shows that PLDA-SVM scoring not only performs significantly better than the conventional PLDA scoring and utilizes the multiple enrollment utterances of target speakers effectively, but also opens up opportunity for adopting sparse kernel machines for PLDA-based speaker verification systems. Specifically, this thesis shows that it is possible to take the advantages of the empirical kernel maps by incorporating them into a more advanced kernel machine called relevance vector machine (RVM). Experiments on NIST 2012 SRE suggest that the performance of PLDA-RVM regression is slightly better than that of PLDA-SVM after performing UP-AVR.
Rights: All rights reserved
Access: open access

Files in This Item:
File Description SizeFormat 
b28068889.pdfFor All Users4.09 MBAdobe PDFView/Open

Copyright Undertaking

As a bona fide Library user, I declare that:

  1. I will abide by the rules and legal ordinances governing copyright regarding the use of the Database.
  2. I will use the Database for the purpose of my research or private study only and not for circulation or further reproduction or any other purpose.
  3. I agree to indemnify and hold the University harmless from and against any loss, damage, cost, liability or expenses arising from copyright infringement or unauthorized usage.

By downloading any item(s) listed above, you acknowledge that you have read and understood the copyright undertaking as stated above, and agree to be bound by all of its terms.

Show full item record

Please use this identifier to cite or link to this item: