Disambiguating the ambiguities in natural language processing by using UML models

Wang, Lu

Full metadata record

DC Field	Value	Language
dc.contributor	Department of English	en_US
dc.creator	Wang, Lu	-
dc.identifier.uri	https://theses.lib.polyu.edu.hk/handle/200/7668	-
dc.language	English	en_US
dc.publisher	Hong Kong Polytechnic University	-
dc.rights	All rights reserved	en_US
dc.title	Disambiguating the ambiguities in natural language processing by using UML models	en_US
dcterms.abstract	This research is an analysis of wrong POS tags in Natural Language Processing (NLP) as well as a resolution for some of these wrong tags. It serves three objectives: to identify the wrong tags in the output of the parsing, to analyze the causes underlying these wrong taggings and the characteristics of distribution of wrong tagging in terms of genre and word, and to attempt to provide some resolutions for the wrong tagging by using UML models as well. This study is valuable in that it might help to improve the accuracy of the POS tagging for NLP tool. A Corpus of Travel and Tourism Texts (TnT) is used to obtain data from authentic texts of natural language in English. The TnT corpus contains a collection of eight hundred thousand words and includes four different genres: academic papers, promotional literature, travelogue and online discussion. Each genre contains about two hundred thousand words. After all the texts are tagged with POS tags, a selection of several key words in the texts of each genre are examined closely and analyzed in order to determine their causes before the connection between each genre’s linguistic features and the causes of the incorrect tags is established. The results show that there are three major causes of the wrong tagging: nominal group, preposition ‘to’ and omission. Analysis of the relationship between the genre and these causes reveals that the nominal group is the major contributor in the genre of academic papers, and the omission is the central cause of online discussion. Such characteristics are evidently related to the linguistic features of each genre. Further study of the relationship between the individual key words and the causes of wrong tagging shows that those words which have larger numbers of wrong tagging are lexically ambiguous. On the other hand, words which are not prone to lexical ambiguity are incorrectly tagged mainly because of the statistical approach of the NLP tool.	en_US
dcterms.extent	vii, 68 pages	en_US
dcterms.isPartOf	PolyU Electronic Theses	en_US
dcterms.issued	2013	en_US
dcterms.educationalLevel	All Master	en_US
dcterms.educationalLevel	M.A.	en_US
dcterms.LCSH	English language -- Semantics.	en_US
dcterms.LCSH	Natural language processing (Computer science)	en_US
dcterms.LCSH	Hong Kong Polytechnic University -- Dissertations	en_US
dcterms.accessRights	restricted access	en_US

Files in This Item:

File	Description	Size	Format
b26876188.pdf	For All Users (off-campus access for PolyU Staff & Students only)	441.92 kB	Adobe PDF	View/Open

Copyright Undertaking

As a bona fide Library user, I declare that:

I will abide by the rules and legal ordinances governing copyright regarding the use of the Database.
I will use the Database for the purpose of my research or private study only and not for circulation or further reproduction or any other purpose.
I agree to indemnify and hold the University harmless from and against any loss, damage, cost, liability or expenses arising from copyright infringement or unauthorized usage.

By downloading any item(s) listed above, you acknowledge that you have read and understood the copyright undertaking as stated above, and agree to be bound by all of its terms.

Show simple item record

Please use this identifier to cite or link to this item: https://theses.lib.polyu.edu.hk/handle/200/7668