Identification of protein-ligand binding site using machine learning and hybrid pre-processing techniques

Pao Yue-kong Library Electronic Theses Database

Identification of protein-ligand binding site using machine learning and hybrid pre-processing techniques

 

Author: Wong, Yi Kwan Ginny
Title: Identification of protein-ligand binding site using machine learning and hybrid pre-processing techniques
Degree: Ph.D.
Year: 2015
Subject: Drugs -- Structure-activity relationships.
Proteins.
Ligands (Biochemistry)
Ligand binding (Biochemistry)
Hong Kong Polytechnic University -- Dissertations
Department: Dept. of Electronic and Information Engineering
Pages: xxix, 139 pages : illustrations (some color)
Language: English
InnoPac Record: http://library.polyu.edu.hk/record=b2823812
URI: http://theses.lib.polyu.edu.hk/handle/200/8117
Abstract: The identification of protein-ligand binding site is an important task in structure-based drug design and docking algorithm. In the past two decades, different approaches have been developed to predict the binding site, such as the geometric, energetic and sequence-based methods. The prediction for these approaches is usually based on some scores, which are defined with a single protein property. Then, a threshold of the scores is set to determine the binding sites. However, it is difficult to set the threshold value even after considering the mean and standard deviation from the practical data. This thesis investigates the computational prediction of protein-ligand binding sites from the structure and sequence of proteins. The binding site prediction can be formulated as a problem of binary classification: discriminating whether a location is likely to bind the ligand or not. When the scores are calculated from the protein properties, the algorithm for performing classification becomes very important, which affects the prediction results significantly. In this thesis, a Support Vector Machine (SVM) is proposed to classify the pockets that are most likely to bind ligands on considering the attributes of geometric characteristics, interaction potential, offset from protein, conservation score, and properties surrounding the pockets. Different kinds of protein properties are considered to do the classification instead of only one single protein property as used in some published approaches. First, the grid points near the protein surface are used to represent the locations of binding sites. Our method is compared to eight existing methods on the datasets of LigASite and 198 drug-target complexes. The results show that the proposed method improves the success rate in terms of F-measure and area under the receiver operating characteristic (AUC). Our method improves the AUC measure from 66 to 81 percent without decreasing the F-measure values, and increases the success rate of locating the binding sites within three largest pockets from 74 to 82 percent. Our method also provides more comprehensive results than the others. Similar to many datasets in Bioinformatics, the datasets of protein binding sites encounter the problem of being imbalanced and the complexity of doing classification. Re-sampling has become an important step to pre-process the imbalanced data. It aims at balancing the datasets by increasing the samples of the smaller class (the minority class) and/or decreasing the samples of the larger class (the majority class), which are respectively known as over-sampling and under-sampling. Most of the machine learning tools (including SVM) is biased to the majority class, so that the classification of the minority class might not be done satisfactorily. To deal with the imbalanced dataset of binding sites, random under-sampling is used at this stage.
After that, two hybrid pre-processing re-sampling methods and one under-sampling method are proposed. The first one applies Synthesis Minority Over-sampling Technique (SMOTE) to create new samples of the minority class. However, the resulting large sample size will increase the complexity of the classification model. The efficiency of the learning algorithm applied to the classification model will be decreased. Therefore, an evolutionary algorithm (EA) is introduced to further process the synthetic samples and the samples of the majority class for doing under-sampling. The chosen EA is the CHC algorithm. Since the above proposed method is using an existing method (SMOTE) to over-sample the data, the advantages over some previous hybrid methods are not significant. However, it can decrease the over-sampling rate about 50 percent. Then, the second hybrid pre-processing re-sampling method is proposed, which makes use of fuzzy logic methods to create new samples of the minority class, and CHC as a data cleaning method to the over-sampled dataset. It is found that this pre-processing method can offer an obvious improvement over some previous over-sampling and hybrid methods. From experimental results, our method outperforms the other methods in terms of F-measure and AUC with the lowest over-sampling rate. It also shows its robustness with respect to data complexity. Large imbalanced datasets have caused many difficulties to the classification problem. Therefore, an under-sampling method is proposed to reduce the data size. It makes use of fuzzy logic to select samples of the majority class, and CHC is employed to further reduce the data size. From experimental results, it can be seen that our proposed method improves both the F-measure and AUC. The complexity of the classification model is also compared. It is found that our proposed method brings the lowest complexity among all methods under comparison. Finally, a general comparison of the three proposed pre-processing methods is presented. One of the hybrid methods is selected and applied to the datasets for predicting the protein-ligand binding sites. A SVM with the proposed attributes is employed to identify the binding sites. Improvement over our previous method, which does not use the hybrid pre-processing method, is obtained in the testing datasets of 198 drug-target complexes. Improved results over the dataset of 210 bound structures are also obtained. The improvement in success rate is 3 percent and 6 precent respectively. Our method is also compared to five other prediction methods. The results show that our method can have more protein-ligand the binding sites located successfully.

Files in this item

Files Size Format
b28238126.pdf 4.590Mb PDF
Copyright Undertaking
As a bona fide Library user, I declare that:
  1. I will abide by the rules and legal ordinances governing copyright regarding the use of the Database.
  2. I will use the Database for the purpose of my research or private study only and not for circulation or further reproduction or any other purpose.
  3. I agree to indemnify and hold the University harmless from and against any loss, damage, cost, liability or expenses arising from copyright infringement or unauthorized usage.
By downloading any item(s) listed above, you acknowledge that you have read and understood the copyright undertaking as stated above, and agree to be bound by all of its terms.

     

Quick Search

Browse

More Information