Author: Zhang, Jiahao
Title: Interactive analytics over similarity search
Advisors: Yiu, Ken (COMP)
Degree: Ph.D.
Year: 2023
Subject: Database management
Big data
Hong Kong Polytechnic University -- Dissertations
Department: Department of Computing
Pages: xxii, 170 pages : color illustrations
Language: English
Abstract: The volume and variety of datasets have been expanding at astounding speeds in recent years. A fundamental way for investigating large datasets is similarity search, which is essential for many applications, such as moving object analysis and anomaly electrocardiogram detection. Beyond straightforward similarity search queries, users tend to find more valuable and comprehensive results, e.g., how to attract potential customers in electronic commerce; nevertheless, they are initially constrained by their knowledge of the new coming datasets. Therefore, database management systems are expected to do more than just quickly return similarity search results; they also need to show the summary of datasets, offer suggestions for the users' further questions, and effectively solve their advanced queries. This thesis studies an interactive analysis procedure for users, from simple similarity search to complex analytical queries.
To achieve this research objective, we first accelerate range query, a main subclass of similarity search, among GPS trajectory data. Trajectory range query is a core subroutine in spatial-temporal database management and has numerous applications, e.g., city traffic monitoring. Given a query trajectory, a trajectory dataset, and a distance threshold, the trajectory range query reports all trajectories in the dataset that are within the given distance threshold from the query trajectory. We adopt Discrete Frechét distance (DFD) as the similarity measure, which is a widely used metric and can capture the geographical similarity between two trajectories well. To speed up query evaluation, we propose several lower and upper bounds for DFD, and devise two novel techniques, early termination and invalid cell ignoring, for reducing the exact DFD computation cost. The experimental evaluation shows that our advanced solution is faster than the baseline by up to 50 times in three real datasets.
Next, in addition to reporting similarity search answers quickly, we intend to provide useful statistical summaries on datasets and inspire users to figure out innovative questions. Therefore, we define the distance distribution problem, which can be applied to a variety of datasets and similarity measures. The distance distribution problems have been extensively studied in dozens of applications, e.g., human genome clustering and parameter tuning. Specifically, a given dataset is converted to the distances among objects, then these distances are sorted and plotted as the cumulative distance distribution (CDD) or distance distribution histogram (DDH). Since computing the exact CDD and DDH are unacceptably slow, we provide a computation framework to plot approximate distance distribution functions with bounded error guarantees. Then we devise a suite of optimization techniques to support interactive analysis for users with low latency. We evaluate our proposed solutions on three real datasets with three widely used measurements. By comparing with the sampling-based solution, experimental results show the superiority of our method in terms of both accuracy and efficiency.
The last step in this interactive process is to effectively respond to a variety of analytical queries. To submit these challenging queries, users often need some expert knowledge with the datasets (e.g., the distribution of product attributes); unfortunately, they are also highly costly. In this phase, we choose top-k related queries with continuous preferences as our concentrated scenarios, e.g., uncertain top-k query. These queries have been adopted in many applications like advertisements and marketing analysis, but are expensive to process. By fully exploring continuous space in high-dimensional, we design a generic index, τ-LevelIndex, to answer these queries efficiently. In this context, we propose several construction approaches to build τ-LevelIndex structure in practical for the first time. Then we provide efficient query processing methods based on our index for different queries. We conduct extensive experiments on both real- and synthetic- datasets, and the evaluation results show that our building approaches can construct the index structure with affordable time and space costs. For three representative queries, our index-based solutions outperform the state-of-the-art solutions by up to two or three magnitudes.
Rights: All rights reserved
Access: open access

Files in This Item:
File Description SizeFormat 
6806.pdfFor All Users7.02 MBAdobe PDFView/Open


Copyright Undertaking

As a bona fide Library user, I declare that:

  1. I will abide by the rules and legal ordinances governing copyright regarding the use of the Database.
  2. I will use the Database for the purpose of my research or private study only and not for circulation or further reproduction or any other purpose.
  3. I agree to indemnify and hold the University harmless from and against any loss, damage, cost, liability or expenses arising from copyright infringement or unauthorized usage.

By downloading any item(s) listed above, you acknowledge that you have read and understood the copyright undertaking as stated above, and agree to be bound by all of its terms.

Show full item record

Please use this identifier to cite or link to this item: https://theses.lib.polyu.edu.hk/handle/200/12358