Author: | Wang, Wei |
Title: | Fast subcellular localization by extracting informative regions of protein sequences for profile alignment |
Degree: | M.Phil. |
Year: | 2011 |
Subject: | Proteins -- Analysis. Amino acid sequence Bioinformatics. Hong Kong Polytechnic University -- Dissertations |
Department: | Department of Electronic and Information Engineering |
Pages: | vi, 61 p. : ill. (some col.) ; 30 cm. |
Language: | English |
Abstract: | The determination of protein subcellular localization is vital for the understanding of the functions of proteins and for the design of drugs. However, the experimental methods of subcellular localization are expensive and time-consuming. On the other hand, computational methods provide the potential to annotate large protein datasets in a cost effective and time efficient manner. With the ever increasing amount of sequenced proteins, the gap between the newly found protein sequences and the knowledge of their subcellular localization has widened rapidly. Thus, it is imperative to speedup the subcellular localization algorithms. In this thesis, a cascaded fusion of cleavage site prediction and subcellular localization prediction is developed to alleviate the computational burden of homolog-based prediction methods. Specifically,the informative region (signal peptides or transit peptides) of a protein sequence is first determined by a cleavage site predictor. Then, only the informative segment is applied to a homology-based predictor for the determination of subcellular locations. A cleavage site predictor based on conditional random fields(CRFs) is developed. It was found that CRFs outperform neural networks and hidden Markov models in the prediction of cleavage site positions. To minimize the training and classification time of the subcellular localization predictors, a kernel Fisher discriminator is proposed. Specifically, the profile of the informative segment of a protein sequence is first generated by PSI-BLAST.The profile is then vectorized by computing the profile-alignment scores between the profile and all of the training profiles. The resulting vector is projected onto a low-dimensional space by using a new form of kernel discriminant analysis called kernel perturbation discriminant analysis. The vector in the low-dimensional space is then classified by a support-vector-machine classifier. It was found that the reduction in dimension leads to further computation saving when compared with the direct classification of profile-alignment vectors. The proposed method was evaluated on a newly created redundancy-removed data set using five-fold cross validations. Results show that the method can attain accurate localization while reducing the computational time substantially when compared to some start-of-the-art methods. In particular, it was found that truncating the sequences at their cleavage sites can reduce the profile creation time (by PSI-BLAST) as compared to truncating the profiles. A sensitivity analysis suggests that subcellular localization accuracy is inversely proportional to the discrepancy of the truncation positions with respect to the ground-truth cleavage sites. It was also found that the subcellular localization accuracy of chloroplast transit peptides (cTP) is highly dependent on the correct prediction of their cleavage site, suggesting further investigation is necessary to improve the cleavage site prediction of cTP. |
Rights: | All rights reserved |
Access: | open access |
Files in This Item:
File | Description | Size | Format | |
---|---|---|---|---|
b24561861.pdf | For All Users | 1.28 MB | Adobe PDF | View/Open |
Copyright Undertaking
As a bona fide Library user, I declare that:
- I will abide by the rules and legal ordinances governing copyright regarding the use of the Database.
- I will use the Database for the purpose of my research or private study only and not for circulation or further reproduction or any other purpose.
- I agree to indemnify and hold the University harmless from and against any loss, damage, cost, liability or expenses arising from copyright infringement or unauthorized usage.
By downloading any item(s) listed above, you acknowledge that you have read and understood the copyright undertaking as stated above, and agree to be bound by all of its terms.
Please use this identifier to cite or link to this item:
https://theses.lib.polyu.edu.hk/handle/200/6175