Genomic sequence search and clustering using Q-gram

Yuen, Man-chun

Full metadata record

DC Field	Value	Language
dc.contributor	Department of Computing	en_US
dc.creator	Yuen, Man-chun	-
dc.identifier.uri	https://theses.lib.polyu.edu.hk/handle/200/1337	-
dc.language	English	en_US
dc.publisher	Hong Kong Polytechnic University	-
dc.rights	All rights reserved	en_US
dc.title	Genomic sequence search and clustering using Q-gram	en_US
dcterms.abstract	With the advances in technologies, the amount of biological data such as DNA sequences and microarray data have been increased tremendously in the past decade. In order to obtain knowledge from the data, e.g., enhancing our understanding of the evolutionary changes and the causes of those severe diseases, one has to search for patterns from the databases of large size and high dimensionality. Information retrieval and data mining are powerful tools to extract information from the databases and/or information repositories. In the past several years, there have been attempts to apply these two branches of intelligent techniques to different bioinformatics applications. However, the performance of these existing techniques has not been optimized due to the characteristics of and requirements from biological data, e.g. extremely long genomic sequences with high dimensionality, and interpretable search/mining results. In this thesis, we focus on how to improve the searching and the clustering performance in genomic sequence databases. A Q-gram based genomic search (QgramSearch) algorithm and a Q-gram based genomic sequence clustering (QgramClust) algorithm are proposed. Our QgramSearch can efficiently search the homologous database sequences to a query sequence. It makes use of two novel hashing techniques to enhance the efficiency of indexing and retrieval. These two hashing techniques can better capture the overlapping characteristics in the Q-gram based index. As demonstrated by the experimental results, they run faster than the existing data structures. Besides, we measure the similarity of sequences based on the significance of Q-gram instead of the expensive sequence alignment. Thus, our search algorithm can run faster than the famous Blast algorithm. Following the idea of QgramSearch, a Q-gram based genomic sequence clustering (QgramClust) is proposed. In view of the challenge of expensive pairwise sequence comparison for large database sequences faced by the existing clustering algorithms, QgramClust employs the inverted index of Q-gram in sequence comparison so that the clustering process can be made efficient. Our clustering algorithm is a hybrid of partitioning method and hierarchical method. It quickly clusters a group of nearest neighbors and finally merges the clusters. Our experimental results show that QgramClust runs faster than BlastClust.	en_US
dcterms.extent	viii, 85 leaves : ill. (some col.) ; 30 cm.	en_US
dcterms.isPartOf	PolyU Electronic Theses	en_US
dcterms.issued	2007	en_US
dcterms.educationalLevel	All Master	en_US
dcterms.educationalLevel	M.Phil.	en_US
dcterms.LCSH	Hong Kong Polytechnic University -- Dissertations.	en_US
dcterms.LCSH	Genomics.	en_US
dcterms.LCSH	Bioinformatics.	en_US
dcterms.accessRights	open access	en_US

Files in This Item:

File	Description	Size	Format
b21657580.pdf	For All Users	1.63 MB	Adobe PDF	View/Open

Copyright Undertaking

As a bona fide Library user, I declare that:

I will abide by the rules and legal ordinances governing copyright regarding the use of the Database.
I will use the Database for the purpose of my research or private study only and not for circulation or further reproduction or any other purpose.
I agree to indemnify and hold the University harmless from and against any loss, damage, cost, liability or expenses arising from copyright infringement or unauthorized usage.

By downloading any item(s) listed above, you acknowledge that you have read and understood the copyright undertaking as stated above, and agree to be bound by all of its terms.

Show simple item record

Please use this identifier to cite or link to this item: https://theses.lib.polyu.edu.hk/handle/200/1337