Author: | Li, Zhe |
Title: | Analytical range query acceleration |
Advisors: | Yiu, Ken (COMP) |
Degree: | Ph.D. |
Year: | 2022 |
Subject: | Querying (Computer science) Information retrieval -- Data processing Hong Kong Polytechnic University -- Dissertations |
Department: | Department of Computing |
Pages: | xxii, 165 pages : color illustrations |
Language: | English |
Abstract: | With the ever-increasing data amount in daily business operations, efficiently retrieving query results has gotten considerably more difficult. This is especially true for analytical range queries that need to process a large portion of data to generate analytical results. In this thesis, we identify three analytical range queries, each of which is a read-only query containing a range parameter to limit the scope of the dataset or the output result size. We provide efficient algorithms to reduce their query response time. The first query is range query on block-based storage systems, such as HDFS and Databricks. The range parameter here could be represented by a hyper- rectangle, which is used to select the required records from the multi-dimensional dataset. Due to the large result size, the IO and data scan are the bottlenecks for such a query. To reduce the query cost, existing approaches split the dataset into small partitions to avoid unnecessary data scan. These techniques mostly rely on historical queries to determine the partition layout (i.e., split positions). However, such query-driven approaches assume the future queries are identical to the historical, which is rarely the case in practice. In this work, we fill the research gap of query-driven partitioning when future queries are different from the historical but similar in general. We formally define the similarity and propose split functions to minimize the query cost for future queries. Experimental results show that our method could be up to 70 times more efficient than the state-of-the-art. The second query is range aggregate query in one or two dimensions, such as COUNT, SUM, MIN, and MAX. The range parameter here could be expressed as an interval (for a single dimension) or a rectangle (in two dimensions). Then a selected aggregate query is executed on the records within this range. Such aggregate queries are frequently used for both analysts (in OLAP) and various OLTP scenarios. For example, Foursquare, with more than 50 million monthly active users, helps users find the number of specific POIs (e.g., restaurants) within given regions. In this work, we investigate how to provide approximate range aggregate query results efficiently with bounded errors. We offer an index-based solution which expresses discrete points with polynomial functions, in which we can provide the best tradeoff between query response time, accuracy, and space among all competitors. The third query is similarity search on keyword-induced point groups, where the range parameter contains a value K to restrict the result size. Such that only the most similar K point groups are returned. A keyword-induced point group is formed by geo-positions (e.g., tweet's geo-tag) sharing the same keyword (e.g., the topic in a tweet). We found that if two keyword-induced points groups are close in Hausdorff distance, their keywords are highly likely to have semantic connections. Such information is crucial to targeted marketing and recommendation. However, as the time complexity of this similarity search is proportional to the query point group's size and the dataset size, analysts are unable to retrieve query results quickly. To speed up this query, we suggest a pruning-based method. Experiments on Twitter data show that our technique is up to 6 times faster than the state-of-the-art. |
Rights: | All rights reserved |
Access: | open access |
Copyright Undertaking
As a bona fide Library user, I declare that:
- I will abide by the rules and legal ordinances governing copyright regarding the use of the Database.
- I will use the Database for the purpose of my research or private study only and not for circulation or further reproduction or any other purpose.
- I agree to indemnify and hold the University harmless from and against any loss, damage, cost, liability or expenses arising from copyright infringement or unauthorized usage.
By downloading any item(s) listed above, you acknowledge that you have read and understood the copyright undertaking as stated above, and agree to be bound by all of its terms.
Please use this identifier to cite or link to this item:
https://theses.lib.polyu.edu.hk/handle/200/11711