Full metadata record
DC Field | Value | Language |
---|---|---|
dc.contributor | Department of Computing | en_US |
dc.creator | Choi, Kam-chang | - |
dc.identifier.uri | https://theses.lib.polyu.edu.hk/handle/200/5648 | - |
dc.language | English | en_US |
dc.publisher | Hong Kong Polytechnic University | - |
dc.rights | All rights reserved | en_US |
dc.title | Web crawling : duplication and freshness of web pages | en_US |
dcterms.abstract | This dissertation studies challenges and issues faced in the implementation of Web crawler. A crawler is a system which retrieves web pages from the Web, stores the pages locally to serve different proposes, for a Web search engine indexing, building digital library and so on. A crawler usually downloads hundreds of millions, or even billions, pages in a short period and is responsible for keeping downloaded pages up-to-date. In addition, the crawler has the reasonability to avoid downloading duplicated pages to make sure the limited resource is being best used as well. This dissertation studies how we can build an effective Web crawler which uses as little resource as possible to retrieve "high quality" pages, and maintain pages collection quality, up-to-date and higher value. In order to achieve these goals, we firstly identify common definitions of "(near) duplication", "freshness" of pages, and study mathematical function on quality metrics "freshness" and "age" of pages. We shall discuss topics and technologies closely related to page duplication detection, the web site mirroring detection and URL normalization too. I then propose a data structure and associated algorithms to speed up duplication detection process, at the same time they reduce "false-positive" chance. On the hand, I make recommendation to improve page revisiting policy based on reducing number of pages by duplication detection. I have done three experiments on the suggested data structure and algorithms. Finally, I conclude the dissertation with findings of experiments and suggesting a robust integrated architecture for duplication detection and page revisiting. | en_US |
dcterms.extent | vii, 156 leaves : ill. ; 31 cm. | en_US |
dcterms.isPartOf | PolyU Electronic Theses | en_US |
dcterms.issued | 2010 | en_US |
dcterms.educationalLevel | All Master | en_US |
dcterms.educationalLevel | M.Sc. | en_US |
dcterms.LCSH | Hong Kong Polytechnic University -- Dissertations | en_US |
dcterms.LCSH | Web search engines. | en_US |
dcterms.LCSH | Data mining -- Quality control. | en_US |
dcterms.LCSH | Web sites -- Design | en_US |
dcterms.accessRights | restricted access | en_US |
Files in This Item:
File | Description | Size | Format | |
---|---|---|---|---|
b23526506.pdf | For All Users (off-campus access for PolyU Staff & Students only) | 3.75 MB | Adobe PDF | View/Open |
Copyright Undertaking
As a bona fide Library user, I declare that:
- I will abide by the rules and legal ordinances governing copyright regarding the use of the Database.
- I will use the Database for the purpose of my research or private study only and not for circulation or further reproduction or any other purpose.
- I agree to indemnify and hold the University harmless from and against any loss, damage, cost, liability or expenses arising from copyright infringement or unauthorized usage.
By downloading any item(s) listed above, you acknowledge that you have read and understood the copyright undertaking as stated above, and agree to be bound by all of its terms.
Please use this identifier to cite or link to this item:
https://theses.lib.polyu.edu.hk/handle/200/5648