Author: | Xu, Wenjian |
Title: | Towards efficient analytic query processing in main-memory column-stores |
Advisors: | Lo, Eric (COMP) |
Degree: | Ph.D. |
Year: | 2018 |
Subject: | Hong Kong Polytechnic University -- Dissertations Querying (Computer science) Computer algorithms |
Department: | Department of Computing |
Pages: | xx, 150 pages : illustrations |
Language: | English |
Abstract: | Recently, there is a resurgence of interest in main-memory analytic databases because of the large RAM capacity of modern servers and the increasing demand for real-time analytic platforms. In such databases, operations like scan, sort and join are at the heart of almost every query plan. However, current implementations of these operations have not fully leveraged the new features (e.g., SIMD, multi-core) provided by modern hardware. The goal of this dissertation is to design efficient algorithms for scan, sort and join by judiciously exploiting every bit of RAM and all the available parallelisms in each processing unit. Scan is a crucial operation since it is closest to the underlying data in the query plan. To accelerate scans, a state-of-the-art in-memory data layout chops data into multiple bytes and exploits early-stop capability by high-order bytes comparisons. As column widths are usually not multiples of byte, the last-byte of such layout is padded with 0's, wasting memory bandwidth and computation power. To fully leverage the resources, we propose to weave a secondary index into the vacant bits (i.e., bits originally padded with 0's), forming our new storage layout. This storage layout enables skip-scan, a new fast scan that enables both data skipping and early stopping without any space overhead. With the advent of fast scans and denormalization techniques, sorting could become the new bottleneck. Queries with multiple attributes in clauses like GROUP BY, ORDER BY, SQL:2003 PARTITION BY are common in real workloads. When executing such queries, state-of-the-art main-memory column-stores require one round of sorting per input column. To accelerate that kind of multiĀcolumn sorting operation, we propose a new technique called "code massaging", which manipulates the bits across the columns so that the overall sorting time can be reduced by eliminating some rounds of sorting and/or by improving the degree of SIMD data level parallelism. Join stays as a time-consuming operation when the denormalization overhead is too large to be applicable. Hash joins have been studied, improved, and reexamined over decades. Its major optimization direction is to partition the input columns to make the working set fit into the caches, such that the locality of hash probing is improved. As an alternative, we propose to utilize a secondary index to improve hash joins without physical partitioning. Specifically, in the build phase, hash values are scattered evenly into logical partitions of the hash table; in the probe phase, the secondary index is used as hints to re-order the probing sequence, such that the locality of hash probing is increased. Finally, we benchmark the performance of the proposed techniques in our column-store research prototype. Extensive experiments on benchmarks and real data show that our methods offer significant performance improvement over their counterparts. In addition, our methods also show decent scalability on modern multi-core CPUs. |
Rights: | All rights reserved |
Access: | open access |
Files in This Item:
File | Description | Size | Format | |
---|---|---|---|---|
991022141358003411.pdf | For All Users | 4.28 MB | Adobe PDF | View/Open |
Copyright Undertaking
As a bona fide Library user, I declare that:
- I will abide by the rules and legal ordinances governing copyright regarding the use of the Database.
- I will use the Database for the purpose of my research or private study only and not for circulation or further reproduction or any other purpose.
- I agree to indemnify and hold the University harmless from and against any loss, damage, cost, liability or expenses arising from copyright infringement or unauthorized usage.
By downloading any item(s) listed above, you acknowledge that you have read and understood the copyright undertaking as stated above, and agree to be bound by all of its terms.
Please use this identifier to cite or link to this item:
https://theses.lib.polyu.edu.hk/handle/200/9503