Author: Wan, Shibiao
Title: Protein subcellular localization : gene ontology based machine learning approaches
Degree: Ph.D.
Year: 2014
Subject: Proteins -- Analysis.
Proteins -- Analysis -- Mathematics.
Hong Kong Polytechnic University -- Dissertations
Department: Department of Electronic and Information Engineering
Pages: xxx, 250 pages : color illustrations ; 30 cm
Language: English
Abstract: Proteins, which are essential macromolecules for organisms, need to be located in appropriate physiological contexts within a cell to exhibit tremendous diversity of biological functions. Aberrant protein subcellular localization may lead to a broad range of diseases. Knowing where a protein resides within a cell can give insights on drug target discovery and drug design. Computational methods are required to assist the laborious and time-consuming conventional wet-lab experiments for accurate, fast, reliable and large-scale predictions in proteomics research. This thesis proposes several Gene Ontology (GO) based machine learning approaches for the prediction of subcellular localization of both single-location and multi-location proteins. For the prediction of single-location proteins, two GO-based single-label predictors, namely GOASVM and FusionSVM, are proposed. GOASVM exploits GO information from the gene ontology annotation (GOA) database while FusionSVM extracts GO information from InterProScan and then combines GO information with profile alignment information. It was found that GOASVM (extracting GO from the GOA database) performs significantly better than FusionSVM (extracting GO from InterProScan). Moreover, GOASVM also remarkably outperforms existing state-of-the-art single-label predictors. For the prediction of multi-location proteins, an efficient multi-label predictor, namely mGOASVM, is proposed. mGOASVM extends GOASVM from single-location prediction to multi-location prediction. It possesses the following desirable properties: (1) it uses the frequency of occurrences of GO terms instead of 1-0 values; (2) it uses a more efficient multi-label SVM classifier to handle multi-label problems; and (3) it selects a relevant GO-vector subspace by finding distinct GO terms instead of using the full GO-vector space; (4) it adopts a successive-search strategy to incorporate more useful homologous information for classification. It was found that these properties make mGOASVM outperform other GO-based multi-label predictors.
Based on mGOASVM, several more advanced multi-label predictors are proposed. These predictors further improve the performance of mGOASVM by enhancing the following aspects of the prediction process: 1. Classification Refinement. The classifier adopted by mGOASVM to tackle multi-label problems is rather primitive, thus refining the classification process is necessary. To this end, two multi-label predictors, namely AD-SVM and mPLR-Loc, are proposed. The former adopts an adaptive decision scheme for multi-label SVM classification. The scheme essentially converts the linear SVMs in the classifier into piecewise linear SVMs, which effectively reduces the over-prediction instances while having little influence on the correctly predicted ones, thus improving the prediction performance. The latter adopts a multi-label penalized logistic regression classifier equipped with an adaptive decision scheme, which can also boost the performance. 2. Deeper Feature Extraction. mGOASVM only considers the frequency of occurrences of GO terms, which may not be sufficient for accurate prediction. To overcome this limitation, a multi-label predictor called SS-Loc, which further exploits the semantic similarity over GO, is proposed. Based on SS-Loc, an even more advanced predictor called HybridGO-Loc, which uses both GO frequency features and GO semantic similarity features, is developed. Experimental results demonstrate that HybridGO-Loc performs the best among all of the proposed multi-label predictors as well as other existing GO-based predictors. 3. Dimensionality Reduction. Although a relevant GO-vector subspace has been selected, the feature vectors in mGOASVM are still of high dimensionality. To address the problem of the curse of high dimensionality, an ensemble method based on random projection (RP) is applied to construct two dimensionality-reduction multi-label predictors, namely RP-SVM and R3P-Loc. The former uses multi-label SVM classifiers and the latter uses multi-label ridge regression classifiers. Experimental results suggest that both predictors outperform mGOASVM as well as other state-of-the-art predictors while at the same time impressively reducing the dimensions.
Rights: All rights reserved
Access: open access

Files in This Item:
File Description SizeFormat 
b27629855.pdfFor All Users3.19 MBAdobe PDFView/Open


Copyright Undertaking

As a bona fide Library user, I declare that:

  1. I will abide by the rules and legal ordinances governing copyright regarding the use of the Database.
  2. I will use the Database for the purpose of my research or private study only and not for circulation or further reproduction or any other purpose.
  3. I agree to indemnify and hold the University harmless from and against any loss, damage, cost, liability or expenses arising from copyright infringement or unauthorized usage.

By downloading any item(s) listed above, you acknowledge that you have read and understood the copyright undertaking as stated above, and agree to be bound by all of its terms.

Show full item record

Please use this identifier to cite or link to this item: https://theses.lib.polyu.edu.hk/handle/200/7770