|Title:||Voice activity detection for nist speaker recognition evaluations|
|Subject:||Automatic speech recognition.|
Hong Kong Polytechnic University -- Dissertations
|Department:||Department of Electronic and Information Engineering|
|Pages:||66 leaves : ill. (some col.) ; 30 cm.|
|Abstract:||Since 2008, interview-style speech has become an important part of the NIST Speaker Recognition Evaluations (SREs). Unlike telephone speech, interview speech has a substantially lower signal-to-noise ratio, which necessitates robust voice activity detectors (VADs). This dissertation highlights the characteristics of interview speech files in NIST SREs and discusses the difficulties in performing speech/non-speech segmentation in these files. To overcome these difficulties, this dissertation proposes using speech enhancement techniques as a pre-processing step for enhancing the reliability of energy-based and statistical-model-based VADs. A decision strategy is also proposed to overcome the undesirable effects caused by impulsive signals and sinusoidal background signals. The proposed VAD is compared with five popular VADs. 1. Average-Energy (AE)-Based VAD. This is an energy-based VAD with decisions governed by the linear combination of average magnitude of background noises and signal peaks. 2. Automatic Speech Recognition (ASR) Transcripts. In this VAD, speech/non-speech decisions are based on the ASR transcripts provided by NIST. 3. VAD in the ETSI-AMR Option 2 Coder. This VAD is part of the Adaptive Multi-Rate (AMR) codec released by the European Telecommunication Standard Institute (ETSI). 4. Statistical-Model (SM)-Based VAD. This VAD assumes that the complex frequency components of signals and noises follow a Gaussian distribution and uses likelihood-ratio tests in the frequency domain for speech/non-speech decisions. 5. Gaussian-Mixture-Model (GMM)-Based VAD. This is an extension of the statistical-model-based VAD, which considers the long-term temporal information and harmonic structure in noisy speech. These five VADs have been evaluated on the NIST 2010 dataset. The comparison of VADs leads to seven findings: 1. Noise reduction is vital for VAD under extremely low SNR; 2. Removal of the sinusoidal background noise is of primary importance as this kind of background signal could lead to many false detection in AE-based VAD; 3. A reliable threshold strategy is required to address the impulsive signals; 4. ASR transcripts provided by NIST do not produce accurate speech and non-speech segmentations; 5. Spectral subtraction contributes to both AE-and SM-based VADs; 6. Spectral subtraction makes better use of background spectra than the likelihood-ratio tests in the SM-based VAD; and 7. The proposed SS+AE-VAD outperforms the SM-based VAD, the GMM-based VAD, the AMR speech coder, and the ASR transcripts provided by NIST SRE Workshop.|
As a bona fide Library user, I declare that:
- I will abide by the rules and legal ordinances governing copyright regarding the use of the Database.
- I will use the Database for the purpose of my research or private study only and not for circulation or further reproduction or any other purpose.
- I agree to indemnify and hold the University harmless from and against any loss, damage, cost, liability or expenses arising from copyright infringement or unauthorized usage.
By downloading any item(s) listed above, you acknowledge that you have read and understood the copyright undertaking as stated above, and agree to be bound by all of its terms.
Please use this identifier to cite or link to this item: