Full metadata record
DC FieldValueLanguage
dc.contributorDepartment of Computingen_US
dc.contributor.advisorLo, Eric (COMP)-
dc.creatorXu, Wenjian-
dc.identifier.urihttps://theses.lib.polyu.edu.hk/handle/200/9503-
dc.languageEnglishen_US
dc.publisherHong Kong Polytechnic University-
dc.rightsAll rights reserveden_US
dc.titleTowards efficient analytic query processing in main-memory column-storesen_US
dcterms.abstractRecently, there is a resurgence of interest in main-memory analytic databases because of the large RAM capacity of modern servers and the increasing demand for real-time analytic platforms. In such databases, operations like scan, sort and join are at the heart of almost every query plan. However, current implementations of these operations have not fully leveraged the new features (e.g., SIMD, multi-core) provided by modern hardware. The goal of this dissertation is to design efficient algorithms for scan, sort and join by judiciously exploiting every bit of RAM and all the available parallelisms in each processing unit. Scan is a crucial operation since it is closest to the underlying data in the query plan. To accelerate scans, a state-of-the-art in-memory data layout chops data into multiple bytes and exploits early-stop capability by high-order bytes comparisons. As column widths are usually not multiples of byte, the last-byte of such layout is padded with 0's, wasting memory bandwidth and computation power. To fully leverage the resources, we propose to weave a secondary index into the vacant bits (i.e., bits originally padded with 0's), forming our new storage layout. This storage layout enables skip-scan, a new fast scan that enables both data skipping and early stopping without any space overhead.en_US
dcterms.abstractWith the advent of fast scans and denormalization techniques, sorting could become the new bottleneck. Queries with multiple attributes in clauses like GROUP BY, ORDER BY, SQL:2003 PARTITION BY are common in real workloads. When executing such queries, state-of-the-art main-memory column-stores require one round of sorting per input column. To accelerate that kind of multiĀ­column sorting operation, we propose a new technique called "code massaging", which manipulates the bits across the columns so that the overall sorting time can be reduced by eliminating some rounds of sorting and/or by improving the degree of SIMD data level parallelism. Join stays as a time-consuming operation when the denormalization overhead is too large to be applicable. Hash joins have been studied, improved, and reexamined over decades. Its major optimization direction is to partition the input columns to make the working set fit into the caches, such that the locality of hash probing is improved. As an alternative, we propose to utilize a secondary index to improve hash joins without physical partitioning. Specifically, in the build phase, hash values are scattered evenly into logical partitions of the hash table; in the probe phase, the secondary index is used as hints to re-order the probing sequence, such that the locality of hash probing is increased. Finally, we benchmark the performance of the proposed techniques in our column-store research prototype. Extensive experiments on benchmarks and real data show that our methods offer significant performance improvement over their counterparts. In addition, our methods also show decent scalability on modern multi-core CPUs.en_US
dcterms.extentxx, 150 pages : illustrationsen_US
dcterms.isPartOfPolyU Electronic Thesesen_US
dcterms.issued2018en_US
dcterms.educationalLevelPh.D.en_US
dcterms.educationalLevelAll Doctorateen_US
dcterms.LCSHHong Kong Polytechnic University -- Dissertationsen_US
dcterms.LCSHQuerying (Computer science)en_US
dcterms.LCSHComputer algorithmsen_US
dcterms.accessRightsopen accessen_US

Files in This Item:
File Description SizeFormat 
991022141358003411.pdfFor All Users4.28 MBAdobe PDFView/Open


Copyright Undertaking

As a bona fide Library user, I declare that:

  1. I will abide by the rules and legal ordinances governing copyright regarding the use of the Database.
  2. I will use the Database for the purpose of my research or private study only and not for circulation or further reproduction or any other purpose.
  3. I agree to indemnify and hold the University harmless from and against any loss, damage, cost, liability or expenses arising from copyright infringement or unauthorized usage.

By downloading any item(s) listed above, you acknowledge that you have read and understood the copyright undertaking as stated above, and agree to be bound by all of its terms.

Show simple item record

Please use this identifier to cite or link to this item: https://theses.lib.polyu.edu.hk/handle/200/9503