Parallel data mining algorithms for multi-dimensional points on GPUs

Matsumoto, Takazumi

Author:	Matsumoto, Takazumi
Title:	Parallel data mining algorithms for multi-dimensional points on GPUs
Advisors:	Yiu, Man Lung (COMP) Hung, Edward (COMP) Chung, Korris (COMP)
Degree:	Ph.D.
Year:	2015
Subject:	Data mining. Data mining -- Mathematics High performance computing Hong Kong Polytechnic University -- Dissertations
Department:	Department of Computing
Pages:	xvi, 176 pages : illustrations
Language:	English
Abstract:	Data mining tasks such as clustering, outlier detection and similarity search typically employ a series of algorithms to operate on a large set of data, making them amenable to parallelization. Thus parallelization of data mining operations such as distance computation has been extensively studied in the literature. In recent years, the use of Graphics Processing Units (GPUs) for data mining has been steadily increasing. As state-of-the-art processors now include both CPU and GPU, it is important to consider which approaches benefit from GPU processing and which do not, and apply a heterogeneous processing approach to improve the efficiency when applicable. Similarity search, also referred to as k-nearest neighbor search, is a particular application of distance computation where only to top k results (e.g., top 10) are required. It is used extensively in multimedia search, where only a small subset of possible results are used, and numerous approaches using both exact and approximate k-nearest neighbor methods have been proposed over the years. Our contribution is a new exact distance algorithm that not only outperforms competing exact GPU methods, but can compete with leading approximate GPU methods. It can also leverage heterogeneous (CPU-GPU) processing for improved efficiency. Outlier detection, also known as anomaly detection, involves detecting data points that deviate from expected patterns. It is an important part of applications such as fraud detection and system fault detection. In many real-world applications such as wireless sensor readings and weather forecasts, data are uncertain in nature. Rather than discarding uncertainty information or storing many sample readings, it is possible to instead approximate the probability distribution function (pdf) to compactly store and efficiently calculate uncertain distance when required. Existing GPU approaches are limited to working with certain data, while our contribution is a parallel method for outlier detection on uncertain data. Clustering is another common operation in data mining. Our work focuses on a particular application, label generation for web search results that automatically generates labels for related topics returned by a user's web search. This is accomplished using a fuzzy transduction-based relational model. We also contribute a GPU implementation of our proposed solution.
Rights:	All rights reserved
Access:	open access

Files in This Item:

File	Description	Size	Format
b28157369.pdf	For All Users	5.51 MB	Adobe PDF	View/Open

Copyright Undertaking

As a bona fide Library user, I declare that:

I will abide by the rules and legal ordinances governing copyright regarding the use of the Database.
I will use the Database for the purpose of my research or private study only and not for circulation or further reproduction or any other purpose.
I agree to indemnify and hold the University harmless from and against any loss, damage, cost, liability or expenses arising from copyright infringement or unauthorized usage.

By downloading any item(s) listed above, you acknowledge that you have read and understood the copyright undertaking as stated above, and agree to be bound by all of its terms.

Show full item record

Please use this identifier to cite or link to this item: https://theses.lib.polyu.edu.hk/handle/200/8035