Author: | Xu, Jian |
Title: | Named entity disambiguation from web text |
Degree: | Ph.D. |
Year: | 2014 |
Subject: | Text processing (Computer science) Natural language processing (Computer science) Names. Hong Kong Polytechnic University -- Dissertations |
Department: | Department of Computing |
Pages: | xii, 148 pages : color illustrations ; 30 cm |
Language: | English |
Abstract: | Named entity disambiguation is the problem of grouping name mentions into clusters, with each cluster referring to the same underlying entity. In this thesis, we focus on named entity disambiguation from web text, because finding information about person on the Internet is one of the most common activities of online users. Person{174}s names, however, are highly ambiguous with a large number of people sharing the same name. Named entity disambiguation therefore becomes increasingly important for many applications such as information retrieval, question answering, cross-document co-reference, relation discovery and so on. This leads to our study of named entity disambiguation over the Internet. In general, named entity disambiguation for web text includes two tasks: (1) Web Person Disambiguation (WPD), which groups search results into different clusters with each cluster referring to the same person; and (2) personal profile extraction (PPE), which can help build each person{174}s relational information in the cluster. The main challenges in named entity disambiguation include (1) how to select meaningful features that are unique to identify named entities; (2) how to guarantee high performance in WPD, even if there is no prior knowledge of the number of persons having the same name; (3) how to obtain and select quality training data from an external knowledge base for personal profile extraction (PPE), since manually annotated data is costly to yield and limited in scale. In this thesis, we explore the use of more semantically relevant information for named entity disambiguation on web text. For WPD, our supervised approach can make good use of naturally annotated resource, Wikipedia in particular to alleviate manual annotation efforts and domain dependence problems. We also investigate the usage of keywords as semantically more meaningful information units for WPD. Based on meaningful keyword features, we investigate a hierarchical co-reference resolution technique to place ambiguous person names into different clusters. Our disambiguation method does not require a predefined number of persons and can produce good quality clusters for each person. For PPE, we build a personalized profile by identifying relational facts. Our approach is to incorporate two semantic constraints, including both trigger word and entity type which can help reduce noisy data in profile extraction. Both WPD and PPE are built within the framework of graphical models, which can provide sequential structure for semantic feature extraction and tree structure for both name disambiguation and profile extraction. The methods in this thesis are evaluated on publicly available datasets so that performance comparisons can be made to state-of-the-art works and our approach is proven to be effective in named entity disambiguation. |
Rights: | All rights reserved |
Access: | open access |
Files in This Item:
File | Description | Size | Format | |
---|---|---|---|---|
b27630018.pdf | For All Users | 3.5 MB | Adobe PDF | View/Open |
Copyright Undertaking
As a bona fide Library user, I declare that:
- I will abide by the rules and legal ordinances governing copyright regarding the use of the Database.
- I will use the Database for the purpose of my research or private study only and not for circulation or further reproduction or any other purpose.
- I agree to indemnify and hold the University harmless from and against any loss, damage, cost, liability or expenses arising from copyright infringement or unauthorized usage.
By downloading any item(s) listed above, you acknowledge that you have read and understood the copyright undertaking as stated above, and agree to be bound by all of its terms.
Please use this identifier to cite or link to this item:
https://theses.lib.polyu.edu.hk/handle/200/7776