Author:  Szeto, Chi Cheong 
Title:  Modeling and querying probabilistic RDFS data with correlated triples using Bayesian networks 
Degree:  Ph.D. 
Year:  2014 
Subject:  RDF (Document markup language) Semantic Web. Hong Kong Polytechnic University  Dissertations 
Department:  Dept. of Computing 
Pages:  xv, 119 p. : ill. ; 30 cm. 
Language:  English 
InnoPac Record:  http://library.polyu.edu.hk/record=b2747284 
URI:  http://theses.lib.polyu.edu.hk/handle/200/7489 
Abstract:  Resource Description Framework (RDF)is a World Wide Web Consortium (W3C) data model for the Semantic Web. RDF data are RDF triples, and an RDF triple is a triple (subject, property, object). RDF Schema (RDFS) extends RDF by providing a vocabulary to describe applicationspecific classes and properties, class and property hierarchies, and which classes and properties are used together. RDFS reasoning leverages the vocabulary to derive additional RDF triples from the data. In recent years, probabilistic models for RDF have been proposed to better represent the reallife information, which is full of uncertainties. Existing models either have limited capabilities to model correlated data or ignore the semantics of the data. We argue that being able to model correlated RDF data is necessary. First, RDF data using the RDFS vocabulary are correlated. Second, correlated data occur in practice. Hence, we introduce a probabilistic model called probabilistic RDFS (pRDFS), which encodes statistical relationships among correlated RDF triples and satisfies the RDFS semantics. Representing and performing probabilistic inference on correlated data are expensive. We use Bayesian networks to represent the correlated data and probabilistic logic sampling to perform approximate inference. Since there may exist some truth value assignments that violate the RDFS semantics, we devise a consistency checking algorithm for pRDFS. The algorithm checks that the probabilities of all inconsistent truth value assignments for the correlated RDF triplesare zeros. It is executed once on static data. For data that are frequently updated, we propose an incremental approach that provides fast rechecking each time the data are updated. SPARQL is a W3C query language for RDF. The pattern of a SPARQL query is a conjunction of triple patterns, and a triple pattern is an RDF triple any member of which can be replaced with a variable. A solution to the query is the bindings of the query variables such that the query pattern matches the data or the data derived through the RDFS reasoning. We extend the query by including truth values in the triple patterns to match the uncertain data. Apart from the bindings of the query variables, an answer to the extended query includes the probability of the bindings, which is equal to the probability of the matched data. pRDFS fully specifies the probability distribution of declared data, but not derived data. A single probability value may not be able to specify the probability of the matched data containing derived data, and we show how to compute the probability bounds of the matched data in this case. Finally, we present an experimental evaluation of the running time performance of our proposed algorithms with respect to the data size, the percentage of uncertain data, the size of correlated data (by varying the number of nodes in a Bayesian network), and the complexity of the probability distributions (by varying the degree of nodes in a network). The algorithms were tested on the Berlin SPARQL Benchmark, the Lehigh University Benchmark, and random uncertain data. 
Files  Size  Format 

b2747284x.pdf  1.210Mb 


As a bona fide Library user, I declare that:  


By downloading any item(s) listed above, you acknowledge that you have read and understood the copyright undertaking as stated above, and agree to be bound by all of its terms. 