Speeding up querying and mining operations on data with modern hardware

Wang, Fang

Author:	Wang, Fang
Title:	Speeding up querying and mining operations on data with modern hardware
Advisors:	Yiu, Man Lung (COMP) Shao, Zili (COMP)
Degree:	Ph.D.
Year:	2022
Subject:	Data mining Querying (Computer science) Hong Kong Polytechnic University -- Dissertations
Department:	Department of Computing
Pages:	xviii, 163 pages : color illustrations
Language:	English
Abstract:	In recent years, the rapid expansion of variety, velocity, and volume of data leads to various challenges on efficiency of querying and mining data. In this thesis, we identify three challenging problems on querying and mining data, and propose optimized solutions by exploiting modern hardware like GPUs and emerging non-volatile memory. Data mining enables us to discover hidden knowledge from data. The past decades have witnessed the great successes of data mining in many applications such as bioinformatics and software engineering, business intelligence, and search engines. Similarity computation is a core subroutine of many mining tasks on multi-dimensional data, which are often massive datasets at high dimensionality. However, the ever-expanding volumes and dimensionality of data lead to similarity computation being the bottleneck that prolongs the process of mining. In these mining tasks, the performance bottleneck is caused by the memory wall problem as a substantial amount of data needs to be transferred from the memory to processors. Recent advances in non-volatile memory (NVM) based processing-in-memory (PIM) enjoy the ability to process the data without moving them out of memory, which can reduce data transfer and thus alleviate the performance bottleneck of the mining tasks. Nevertheless, NVM PIM supports specific operations only but not arbitrary operations. We tackle the challenge and carefully exploit NVM PIM to accelerate similarity-based mining tasks on multi-dimensional data without compromising the accuracy of results. Experimental results show that our proposed method achieves up to 11.0x speedup for representative mining algorithms such as kNN classification and k-means clustering. Blockchains are distributed systems that provide decentralized, secure, and shared data access among untrusted parties. They have been used in applications such as banking, supply chain, healthcare, and IoT scenarios. New data such as transactions are recorded into a block in the append-only (and immutable) manner. Blockchain maintains a linked list of blocks and grows by mining new blocks. However, the mining process consumes significant computational overhead and easily prolongs the progress of data storage. This is because the mining processing requires the validity of new data to be verified through consensus mechanism - proof-of-work, which expends computational effort solving an arbitrary mathematical puzzle. To improve the data storage performance, we propose a NVM PIM architecture to accelerate blockchain mining. NVM PIM can directly process data at the memory arrays. The large number (e.g., dozens of thousands) of memory arrays of NVM release massive parallelism, and thus is promising to speed up blockchain mining that demands expensive computation resources. We utilize matrix transformation to map the operations in blockchain mining into the matrix multiplication operation, which is supported by NVM PIM. We further propose an intra-transaction and inter-transaction parallel framework to make full use of the parallelism of NVM PIM. The experimental results show that our proposed method outperforms CPU-based and GPU-based implementations significantly. Analytical query processing is an important function in data warehouses, for systematical data reporting and analysis. Efficient query processing is significant to support sound and timely strategic decisions in todays competitive, fast-evolving enterprise industry. However, with the increase of data volume and complexity of analysis scenarios in real applications, a query with joining multiple relations can easily cost hours and even days. Cardinality estimation estimates the size of the intermediate result relations. The query processing relies on the estimated cardinalities to evaluate the costs of the execution plans and can find the optimal execution plan if the estimations are error-free. Deep learning has shown attractive effectiveness to provide more accurate estimation than traditional methods. Nevertheless, learning-based estimators consume more estimation time since the model inference triggers expensive computation. GPU is a prevalent accelerator for deep learning model inference due to its high parallelism with many cores. We propose a GPU-enabled learning-based progressive cardinality estimator (LPCE) to speed up query end-to-end execution. LPCE runs on GPUs and enjoys both short inference time and high estimation accuracy. In addition, to serve cardinality estimation before query execution, LPCE can progressively refine the estimations during the query execution process. We integrate LPCE into PostgreSQL and conduct extensive experiments on real datasets. The results show that LPCE significantly outperforms existing cardinality estimators in end-to-end query execution time. In summary, in this thesis, we study how to leverage modern hardware, especially NVM and GPU, to optimize three types of querying and mining operations on data: similarity-based data mining, blockchain mining, and analytical query execution.
Rights:	All rights reserved
Access:	open access

Files in This Item:

File	Description	Size	Format
6461.pdf	For All Users	4.24 MB	Adobe PDF	View/Open

Copyright Undertaking

As a bona fide Library user, I declare that:

I will abide by the rules and legal ordinances governing copyright regarding the use of the Database.
I will use the Database for the purpose of my research or private study only and not for circulation or further reproduction or any other purpose.
I agree to indemnify and hold the University harmless from and against any loss, damage, cost, liability or expenses arising from copyright infringement or unauthorized usage.

By downloading any item(s) listed above, you acknowledge that you have read and understood the copyright undertaking as stated above, and agree to be bound by all of its terms.

Show full item record

Please use this identifier to cite or link to this item: https://theses.lib.polyu.edu.hk/handle/200/11985