Author: Mak, Yuen-shan
Title: Efficient information access in Web
Degree: M.Sc.
Year: 2001
Subject: Hong Kong Polytechnic University -- Dissertations
Internet searching
Web search engines
Department: Multi-disciplinary Studies
Department of Computing
Pages: 102 leaves : ill. (some col.) ; 30 cm
Language: English
Abstract: Internet has led to the build up of a global information infrastructure. With the ever-growing amount of information there, searching for information will not be an easy matter. One may have to spend lots of time to search on useful information and filter out unwanted information everyday. To explore a web site, a site map might be useful to get a short cut and reach the target information much easier. Unless the site map is provided, the only way is to drill into the web site by following hyperlinks in each web document repeatedly, read the information, filtering out unwanted and so forth. This is really time-consuming. Perhaps some search engines can be used to find out the related web documents from the Internet at a short time, the transversal between different web pages indeed will imply a high bandwidth usage and lead to inefficient web surfing. Regarding to these, this project would introduce a methodology on Document Information Extraction. This is a kind of Inter-document Information extraction that aims to re-organize a document cluster intelligently with respect to a requested web page named as root document. The document cluster works like a site map, but is dynamic and can span across several web sites. It can be imagined, as a multi-branched tree-liked structure. With the use of an algorithm, a document cluster is constructed by following all outgoing hyperlinks of the root document, its sub-document pages and all outgoing links of each sub-document page repeatedly. There is no doubt that the cluster may grow unlimited. Hence, the Greedy Clustering Algorithm is adopted so as to adjust the growth of the cluster to a reasonable size intelligently. Based on the algorithm, a prototype will be developed. It is mainly composed of two modules, a Page Content Proxy Server and a Page Content Proxy Client. Information collection from a large number of web pages is a must in the algorithm. In order to reduce the processing time on each web page, a procedure will be carried out to achieve the Document Context Extraction. Further, an offline module will be used for analyzing web access logs that aims to provide more realistic statistical data on the hit rate of each web page. Afterwards, experiments will be taken to fine tune the algorithm so that it can give a more reasonable and acceptable document cluster upon request. In order to gain a higher flexibility, XML (Extensible Markup Language) is adopted in the prototype. The document cluster will be generated at server side in form of XML pages, which will be sent to the client modules for further processing. Upon receiving the XML data stream from the server, the client module will parse the XML data stream and convert it into other presentable data formats. Currently, two formats are presented with the use of VRML (Virtual Reality Modeling Language) and DOM (Document Object Model). For the VRML format, a transformer is built-in at the client-side so that a three-dimension modeling view can be visualized. For DOM, JavaScript is used for accessing the parsed XML data and; a two-dimensional tree format is output. In this project, XML is highly adopted so as to promote platform independence and to facilitate further enhancement on other existing or newly developed protocols, e.g. Wireless Markup Language (WML) on Wireless Application Protocol (WAP).
Rights: All rights reserved
Access: restricted access

Files in This Item:
File Description SizeFormat 
b1668140x.pdfFor All Users (off-campus access for PolyU Staff & Students only)7.72 MBAdobe PDFView/Open


Copyright Undertaking

As a bona fide Library user, I declare that:

  1. I will abide by the rules and legal ordinances governing copyright regarding the use of the Database.
  2. I will use the Database for the purpose of my research or private study only and not for circulation or further reproduction or any other purpose.
  3. I agree to indemnify and hold the University harmless from and against any loss, damage, cost, liability or expenses arising from copyright infringement or unauthorized usage.

By downloading any item(s) listed above, you acknowledge that you have read and understood the copyright undertaking as stated above, and agree to be bound by all of its terms.

Show full item record

Please use this identifier to cite or link to this item: https://theses.lib.polyu.edu.hk/handle/200/2296