Structural similarity on XML data and its applications

Ng, Kar-leung

Author:	Ng, Kar-leung
Title:	Structural similarity on XML data and its applications
Degree:	Ph.D.
Year:	2007
Subject:	Hong Kong Polytechnic University -- Dissertations. XML (Document markup language) Data structures (Computer science)
Department:	Department of Computing
Pages:	xiii, 154 p. : ill. ; 30 cm.
Language:	English
Abstract:	This dissertation addresses issues of detecting the structural similarity of XML (extensible Markup Language) documents from heterogeneous sources, and its applications to the areas of querying applications and web mining. This topic has brought much attention and a number of similarity measures have been proposed in recent years. Unlike most distance metrics which are based on the direct transformation between documents, a successful similarity measure should be able to assign higher scores to documents of similar types. To address the problem, we detect and analyze the document conformity against a schema which governs the document structure. Therefore, the goal of our study is to investigate issues involved in defining the structural measure which is supporting the detection of documents of similar types. (1) We first present a formal framework in defining the structural similarity of a document against a schema. We illustrate that the choice of schema languages, DTD or XML Schema, do not constitute major difference in the framework. (2) We extend the framework to compare documents without the prerequisite of a schema. Structural similarity has a wide variety of applications in automatic document processing. In the second half of the dissertation, we demonstrate its applicability to XML indexing, proximity querying and group detection using the clustering technique. We first propose RRSi, a novel structural index designed for structure-based query lookup on heterogeneous sources of XML documents supporting proximate query answers. The index successfully avoids the redundant processing of structurally irrelevant candidates that might show good content relevance. An optimized version, oRRSi, of the index is also developed to further reduce in both space and computational complexity. To the best of our knowledge, the structural indexes are the first work supporting proximity twig queries on XML documents. The experiment results show that the RRSi and oRRSi based query processing significantly outperforms previously proposed techniques in the XML repositories with structural heterogeneity. Then we examine the applicability of structural similarity in the area of web mining. A sitemap is a convenient navigation link system reflecting the true key website structure, and have become a standard website feature. Although website owners may choose to present their services or information in a variety of different ways, a certain level of similarity in web structure and content are often observed for websites in the same domain since they typically follow some evolved community standard. Clustering sitemaps by structure helps to detect groups of websites in identical domains and is complimentary to the link based ranking algorithmic function. We examine in this dissertation how to cluster sitemaps as tree structured documents. We introduce a new similarity measure between sitemaps, which reflects their key characteristics in the scoring. Moreover, the measure supports a centroid-based clustering algorithm avoiding pair-wise comparisons that achieves a significant gain in efficiency. We implemented the proposed clustering algorithm and ran extensive experiments on real and synthetic datasets showing their effectiveness and efficiency over other clustering algorithms, which were based on previous similarity metrics.
Rights:	All rights reserved
Access:	open access

Files in This Item:

File	Description	Size	Format
b21167539.pdf	For All Users	3.45 MB	Adobe PDF	View/Open

Copyright Undertaking

As a bona fide Library user, I declare that:

I will abide by the rules and legal ordinances governing copyright regarding the use of the Database.
I will use the Database for the purpose of my research or private study only and not for circulation or further reproduction or any other purpose.
I agree to indemnify and hold the University harmless from and against any loss, damage, cost, liability or expenses arising from copyright infringement or unauthorized usage.

By downloading any item(s) listed above, you acknowledge that you have read and understood the copyright undertaking as stated above, and agree to be bound by all of its terms.

Show full item record

Please use this identifier to cite or link to this item: https://theses.lib.polyu.edu.hk/handle/200/2634