Uncertainty analysis and data quality assurance in spatial big data

Chen, Pengfei

Author:	Chen, Pengfei
Title:	Uncertainty analysis and data quality assurance in spatial big data
Advisors:	Shi, Wenzhong (LSGI)
Degree:	Ph.D.
Year:	2020
Subject:	Geographic information systems -- Data processing Big data Uncertainty Spatial data infrastructures -- Quality control Hong Kong Polytechnic University -- Dissertations
Department:	Department of Land Surveying and Geo-Informatics
Pages:	xv, 142 pages : color illustrations
Language:	English
Abstract:	Since the term 'big data' was coined for the first time in 2005, it has unleashed a worldwide evolution in scientific research and business. Spatial big data (SBD), the big data associated with geographical information, are one of the most valuable products in modern science motivated by the rapid development of smart technology and sensor technology. Generally, SBD can be classified as earth observation data and human activity data. Thus far, SBD has stimulated a continuous wave of innovations in a wide range of disciplines, such as geoscience, urbanology and environmental science. Uncertainty has been long recognised as an essential element affecting the entire process of spatial data production and analysis. Inappropriate uncertainty management can result in misleading knowledge and cause tremendous losses. Over the past decades, considerable efforts have been made to develop theories and methods for uncertainties in spatial data and analytics. However, given the continuously increasing complexity and volume of SBD, traditional uncertainty analytics (especially those involve external information, intensive labour and personal intuition) has become less efficient and even invalid. On this basis, this thesis aims to propose efficient methods based on data mining techniques, which are less dependent on external resources, for the uncertainty evaluation, modelling and quality assurance in selected SBD types. Special attention has been paid to spatial vector data, trajectory data and spatial time-series data, which are amongst the most representative SBD types and significant in practical applications. During spatial data production, quality assessment and control (QAC) is the primary process that controls data uncertainty and reliability. Traditionally, reference data are required during QAC for direct comparison to discover errors. However, in the context of SBD, complete and accurate reference data are always unavailable for many reasons, such as the vast area of coverage and the administrative barrier of administrative divisions. In such a situation, developing reference-reduced or reference-free methods for QAC is necessary and promising. Therefore, this thesis started with a reference-free method to locate potential errors in multilayer vector data, which are the most representative data structure in practice. Spatial relationship complexity was adopted as an indicator for the identification of potential errors. The linkage between spatial relationship complexity and errors was initially discussed. A contribution function based on distance measurement was introduced to estimate the contribution of each vector layer, and the results were further taken as input into an entropy-based indicator to obtain the overall complexity measurement. On the basis of experiments on simulated and real-life datasets, the proposed approach outperformed state-of-the-art methods in providing realistic complexity measurements, and the resultant complexity map could provide useful information to facilitate manual inspection during QAC for large-scale vector data. To extend our idea to a single vector layer, another reference-free method was proposed to identify potential classification errors in land use/land cover (LULC) data. In this method, land patches belonging to the same land class were assumed to present similar spectral-spatial features. In view of the influence of production scale in feature extraction, an adaptive segmentation strategy based on local variance index was designed to obtain homogenous segments. A clustering operation was further applied to the extracted features to distinguish outliers conservatively. Finally, an entropy-based indicator was developed to measure the likelihood of a land patch to be erroneously classified based on the clustering results. Experiments showed that the proposed method is superior amongst other state-of-the-art methods in terms of high accuracy. During the QAC for traditional classification data, the influence from data production specification has always been neglected; however, this may be inapplicable to SBD due to its vast data volume. For this reason, this study proposed an evaluation method, specifically for the uncertainty caused by the minimum mapping unit in LULC data. An assumption was initially made on the skewed distribution of land patch sizes and validated on open data. The optimal skewed distribution was determined through curve fitting technique. Thus, the omission errors could be evaluated based on the fitting results. The resultant omission errors were further used to estimate the commission errors by considering the conversion between land classes based on the statistics of their adjacency. Finally, a confusion matrix could be obtained to evaluate the overall classification accuracy. Experiments on real-life land cover dataset showed that the proposed method could accurately estimate the classification uncertainty for most land classes. Trajectory and spatial time-series data are two of the most representative SBD commonly used in current studies on human behaviour, mobility and transportation. Therefore, specific efforts have been made to address prominent uncertainty issues of the two data types. For trajectory data, this study focused on modelling the uncertainty caused by sampling and measurement errors. This issue was selected because an effective uncertainty model is critical for many applications on trajectories, such as spatial query and visualisation. To reduce redundant uncertain regions in state-of-theart models, an adaptive error ellipse (AEE) model was established, in which the optimal size for an error ellipse was obtained based on the Minkowski distance metric through mining the intrinsic characteristics in the trajectory data. A broad AEE model was further developed to include measurement error during the model construction, and an ellipse formulation was deduced to avoid intensive computation of the theoretical model and enhance practical applicability. Experiments on five real-life datasets showed that in comparison with the state-of-the-art methods, the proposed models could significantly narrow the uncertain ellipses while retaining a comparative accuracy. A case study on trajectory similarity analysis was further conducted to exemplify the practical advantages of the proposed models compared with the stateof-the-art methods. Lastly, this study discussed the uncertainty in spatial time series, and special attention was provided to evaluate its predictability. To reduce the inference from the randomisation in human behaviour on predictability evaluation, a novel evaluation method was proposed on the basis of entropy indexes and time series decomposition technique. Experiments were conducted on a real-life metro ridership dataset to validate the effectiveness of the proposed method. Results showed that the proposed indicators could reflect the evaluation values with higher correlation with the real predictability results than the traditional indicators. To further demonstrate its usefulness, an uncertainty-based loss function was implemented using the predictability measurements and was applied to the classical long short-term memory model for validation. Experiments showed that the proposed loss function could significantly improve prediction accuracy. The proposed method is theoretically extensible to other SBD in the form of time series. Uncertainty will always be a major scope in future studies of SBD. The adoption of data mining technique may improve the execution efficiency of uncertainty analytics and data quality assurance and further enhance the reliability of related applications in a broad sense.
Rights:	All rights reserved
Access:	open access

Files in This Item:

File	Description	Size	Format
5126.pdf	For All Users	13.01 MB	Adobe PDF	View/Open

Copyright Undertaking

As a bona fide Library user, I declare that:

I will abide by the rules and legal ordinances governing copyright regarding the use of the Database.
I will use the Database for the purpose of my research or private study only and not for circulation or further reproduction or any other purpose.
I agree to indemnify and hold the University harmless from and against any loss, damage, cost, liability or expenses arising from copyright infringement or unauthorized usage.

By downloading any item(s) listed above, you acknowledge that you have read and understood the copyright undertaking as stated above, and agree to be bound by all of its terms.

Show full item record

Please use this identifier to cite or link to this item: https://theses.lib.polyu.edu.hk/handle/200/10722