Author: | Szeto, Chi Cheong |
Title: | Modeling and querying probabilistic RDFS data with correlated triples using Bayesian networks |
Degree: | Ph.D. |
Year: | 2014 |
Subject: | RDF (Document markup language) Semantic Web. Hong Kong Polytechnic University -- Dissertations |
Department: | Department of Computing |
Pages: | xv, 119 p. : ill. ; 30 cm. |
Language: | English |
Abstract: | Resource Description Framework (RDF)is a World Wide Web Consortium (W3C) data model for the Semantic Web. RDF data are RDF triples, and an RDF triple is a triple (subject, property, object). RDF Schema (RDFS) extends RDF by providing a vocabulary to describe application-specific classes and properties, class and property hierarchies, and which classes and properties are used together. RDFS reasoning leverages the vocabulary to derive additional RDF triples from the data. In recent years, probabilistic models for RDF have been proposed to better represent the real-life information, which is full of uncertainties. Existing models either have limited capabilities to model correlated data or ignore the semantics of the data. We argue that being able to model correlated RDF data is necessary. First, RDF data using the RDFS vocabulary are correlated. Second, correlated data occur in practice. Hence, we introduce a probabilistic model called probabilistic RDFS (pRDFS), which encodes statistical relationships among correlated RDF triples and satisfies the RDFS semantics. Representing and performing probabilistic inference on correlated data are expensive. We use Bayesian networks to represent the correlated data and probabilistic logic sampling to perform approximate inference. Since there may exist some truth value assignments that violate the RDFS semantics, we devise a consistency checking algorithm for pRDFS. The algorithm checks that the probabilities of all inconsistent truth value assignments for the correlated RDF triplesare zeros. It is executed once on static data. For data that are frequently updated, we propose an incremental approach that provides fast rechecking each time the data are updated. SPARQL is a W3C query language for RDF. The pattern of a SPARQL query is a conjunction of triple patterns, and a triple pattern is an RDF triple any member of which can be replaced with a variable. A solution to the query is the bindings of the query variables such that the query pattern matches the data or the data derived through the RDFS reasoning. We extend the query by including truth values in the triple patterns to match the uncertain data. Apart from the bindings of the query variables, an answer to the extended query includes the probability of the bindings, which is equal to the probability of the matched data. pRDFS fully specifies the probability distribution of declared data, but not derived data. A single probability value may not be able to specify the probability of the matched data containing derived data, and we show how to compute the probability bounds of the matched data in this case. Finally, we present an experimental evaluation of the running time performance of our proposed algorithms with respect to the data size, the percentage of uncertain data, the size of correlated data (by varying the number of nodes in a Bayesian network), and the complexity of the probability distributions (by varying the degree of nodes in a network). The algorithms were tested on the Berlin SPARQL Benchmark, the Lehigh University Benchmark, and random uncertain data. |
Rights: | All rights reserved |
Access: | open access |
Files in This Item:
File | Description | Size | Format | |
---|---|---|---|---|
b2747284x.pdf | For All Users | 1.18 MB | Adobe PDF | View/Open |
Copyright Undertaking
As a bona fide Library user, I declare that:
- I will abide by the rules and legal ordinances governing copyright regarding the use of the Database.
- I will use the Database for the purpose of my research or private study only and not for circulation or further reproduction or any other purpose.
- I agree to indemnify and hold the University harmless from and against any loss, damage, cost, liability or expenses arising from copyright infringement or unauthorized usage.
By downloading any item(s) listed above, you acknowledge that you have read and understood the copyright undertaking as stated above, and agree to be bound by all of its terms.
Please use this identifier to cite or link to this item:
https://theses.lib.polyu.edu.hk/handle/200/7489