Cantonese speech recognition

Pao Yue-kong Library Electronic Theses Database

Cantonese speech recognition


Author: Lam, Yan-yan
Title: Cantonese speech recognition
Degree: M.Phil.
Year: 2001
Subject: Automatic speech recognition
Speech processing systems
Cantonese dialects -- Data processing
Hong Kong Polytechnic University -- Dissertations
Department: Dept. of Computing
Pages: 128 leaves : ill. ; 30 cm
Language: English
InnoPac Record:
Abstract: Cantonese speech recognition consists of three parts: translating our perceived Cantonese speech to their respective tone patterns and syllables and converting them into texts based on the contextual information in the passage. Our research focuses on the first domain, the Cantonese tone recognition; while leaving the syllable recognition as a future work. For the language modeling, since it requires further linguistic knowledge and searching algorithm, it should be completed as another research work. Starting from this goal, we develop our research framework based on pitch synchronization. Pitch synchronization means information is extracted in phase with the movement of pitch in speech signals. In our research, information refers to the tonal patterns of Chinese speech. Thus, pitch synchronization for tone recognition means the extraction of pitch contour, which is the changes in the fundamental frequency of speech signals, is achieved by first identifying the beginning and end of each pitch period and then measuring the interval between each pair of pitch marks. The advantage of using this so-called pitch synchronous pitch extraction over the conventional non-pitch synchronous one, such as the autocorrelation method and the cepstrum pitch determination, is the independence of the analysis frame size for different speakers. Hence it can handle both low-pitch as well as high-pitch speakers. Formally, the identification of these pitch marks is called the epoch detection in which each glottal closure instant during voicing is located. From our survey of the existing epoch detection methods, the major problem is the degradation of performance in noise contaminated environment and the difficulties in identifying the epochs at the boundaries of the utterances. Wavelet is famed for its good singularity detection ability, however, leaving much room for improvement under the above conditions. The difficulties come from the 'too good' characteristic of the wavelet for singularity detection, while viewing in another perspective, is sensitive to noise and ineffective for weaker excitation. Hence, a matching scheme to confirm the existence of the epochs is a must and the detection correctness largely depends on this matching scheme. Our proposed Combined Wavelet Epoch Detector (CWED) is based on two wavelets: the Spline and Gaussian wavelet, to improve the deterministic matching scheme. The rationale is to retain the good singularity detection property of the Spline wavelet for epoch detection while utilizing the coarse but robust epoch occurrence identification property of the Gaussian wavelet found experimentally. Results of our proposed scheme is tested with different noise conditions and achieves 26% improvement in recall performance while retaining the relative position consistency of 1.4ms. The realization of the detected epochs on tone recognition is done with our proposed Smoothed Contour Tone Recognizer (SCTR). Pitch contour is not directly measured from the intervals of the epoch marks owing to the identification defects obtained during the detection. Instead, a smoothing algorithm is proposed and implemented before the pitch frequencies are extracted for feature extraction and tone recognition. This smoothing algorithm is based on the distinction between perceptively good pitch frequencies and the irregular pitch frequencies caused by the mistaken epochs with a pitch tracking algorithm and the estimation of the complete pitch contour is done by a linear/quadratic interpolation of the former subset. The accuracy for recognizing the six non-entering tones average over the different noise types and noise levels (down to 0dB) are 72% (male) and 75% (female) for the single speaker cases; and having 59% (male) and 69% (female) for the multiple speaker cases. The overall improvement in accuracy in all SNRs (from clean down to -18dB) compared with the baseline tone recognizer for the single speaker and multiple speaker cases are 23% and 18% (male); 19% and 14% (female) respectively. Further performance comparison of the SCTR was conducted with the replacement of the combined wavelet epoch detector (CWED) with the K&B algorithm. However, from the result there is no evidence that the CWED provides improved performance over the K&B algorithm, in terms of the tone recognition accuracy.

Files in this item

Files Size Format
b15784988.pdf 5.069Mb PDF
Copyright Undertaking
As a bona fide Library user, I declare that:
  1. I will abide by the rules and legal ordinances governing copyright regarding the use of the Database.
  2. I will use the Database for the purpose of my research or private study only and not for circulation or further reproduction or any other purpose.
  3. I agree to indemnify and hold the University harmless from and against any loss, damage, cost, liability or expenses arising from copyright infringement or unauthorized usage.
By downloading any item(s) listed above, you acknowledge that you have read and understood the copyright undertaking as stated above, and agree to be bound by all of its terms.


Quick Search


More Information