Author: | Jiang, Yuechi |
Title: | Acoustic and speech signal classification : from features to classifiers |
Advisors: | Leung, H. F. Frank (EIE) |
Degree: | Ph.D. |
Year: | 2021 |
Subject: | Automatic speech recognition Speech processing systems Signal processing -- Digital techniques Hong Kong Polytechnic University -- Dissertations |
Department: | Department of Electronic and Information Engineering |
Pages: | xxii, 141 pages : color illustrations |
Language: | English |
Abstract: | Nowadays, with the fast development of multimedia technologies, acoustic and speech signals are playing more and more important roles. Acoustic and speech signals can deliver beneficial information for applications such as security check, audio authentication, environment analysis, and context-aware navigation. Most of the time, the study of acoustic and speech signals is conducted by performing classification. For example, verifying the similarity of two acoustic signals can be treated as checking whether they belong to the same class or not; detecting the occurrence of an acoustic event can be treated as finding out a segment that belongs to a specific class. The work discussed in this thesis focuses on the classification of acoustic and speech signals, which is a fundamental problem covering a wide range of applications. Three key components that form a classification system are 1) feature representations, 2) classifiers, and 3) feature transformation techniques. These components are all important to the success of a classification system and deserve a comprehensive investigation. Good feature representations are crucial for the regular operation of a classification system. In general, a good feature representation should carry enough information for well describing the acoustic sample. This often implicitly requires the dimensionality of the feature representation to be high. In this thesis, we consider two high-dimensional feature representations, viz. the Gaussian supervector (GSV) and the identity vector (i-vector). GSV is fast in computation, but its dimensionality is unchangeable. I-vector has a changeable dimensionality, but its computation can be time-consuming owing to the requirement of estimating additional model parameters. To balance the computational efficiency and the dimensional flexibility, we propose feature representations based on the mixture of factor analyzers (MFA), such as the MFA latent vector (MFALV). MFALV is comparable to GSV and i-vector in effectiveness, and has a similar flexibility in dimensionality but a higher computational efficiency as compared to i-vector. By analyzing the similarity between different feature representations, we propose the generic supervector, which generalizes GSV and MFALV. I-vector can then be obtained by post-processing the generic supervector. It is noticed that the generic supervector can explain the structure of the classic convolutional neural network and the residual network. The support vector machine (SVM) and the probabilistic linear discriminant analysis (PLDA) model are two prevalent classifiers for classifying high-dimensional feature representations, such as GSV and i-vector. Although PLDA may outperform SVM for speaker verification, it is inefficient in handling many training data, especially when the dimensionality of the feature representation is high. To address the inefficiency issue of PLDA, we propose a scalable formulation that enables it to do classification efficiently irrespective of the quantity of the training data. The sparse representation (SR) and the SR-based classifier (SRC) are also good at classifying high-dimensional feature representations. Still, the computation of SR is slow because of the L1-norm constraint involved in the objective function. The collaborative representation (CR), which replaces the L1-norm constraint by the L2-norm constraint, is computationally more efficient than SR. To boost the discrimination ability of CR, we propose the discriminative CR (DCR), which incorporates the class information and thus better suits the classification tasks. Two probabilistic models are also investigated, viz. the Gaussian mixture model (GMM) and the restricted Boltzmann machine (RBM). Although both can be used for probability estimation and classification, their different model assumptions determine their different applicability. GMM is suitable for processing low-dimensional decorrelated feature representations, whereas RBM is suitable for processing high-dimensional correlated feature representations. Both yet require a large number of training data. Another important use of RBM is to work as the basic building block for constructing a deep belief net (DBN). By adding a softmax layer at the end of DBN, a deep neural network (DNN) is formed. This DBN-DNN is a discriminative classification model. Our experiments then validate the importance of a high dimensionality for DBN-DNN to take effect. If the original feature representation does not work well, suitable feature transformation techniques may help. Two feature transformation techniques are popular in the area of speech processing, viz. the nuisance attribute projection (NAP), and the linear discriminant analysis (LDA). As a generalization, their kernel versions, viz. the kernel NAP (KNAP) and the kernel discriminant analysis (KDA), introduce an implicit feature mapping before performing the projection, which may be beneficial in some circumstances. The detailed derivations for the kernel-based formulations are given in this thesis, and comparative experiments are conducted to investigate the effectiveness of different feature transformation techniques. To comprehensively investigate the performance of different feature representations, classifiers, and feature transformation techniques, we perform experiments on four different datasets, including two speech datasets for doing speaker identification tasks and two acoustic datasets for doing acoustic scene classification tasks. The experimental results and discussions reveal the characteristics of different feature representations, feature transformation techniques, and classifiers. In general, no one type of feature representation or classifier always surpasses the others for all conditions, which implies the importance of choosing a suitable combination. We hope these analyses may help devise new features representations, classifiers, and feature transformations. |
Rights: | All rights reserved |
Access: | open access |
Copyright Undertaking
As a bona fide Library user, I declare that:
- I will abide by the rules and legal ordinances governing copyright regarding the use of the Database.
- I will use the Database for the purpose of my research or private study only and not for circulation or further reproduction or any other purpose.
- I agree to indemnify and hold the University harmless from and against any loss, damage, cost, liability or expenses arising from copyright infringement or unauthorized usage.
By downloading any item(s) listed above, you acknowledge that you have read and understood the copyright undertaking as stated above, and agree to be bound by all of its terms.
Please use this identifier to cite or link to this item:
https://theses.lib.polyu.edu.hk/handle/200/11107