An evolutionary approach to discover composite features for effective text classification of small classes

Pao Yue-kong Library Electronic Theses Database

An evolutionary approach to discover composite features for effective text classification of small classes

 

Author: Wong, Ka-shing Alex
Title: An evolutionary approach to discover composite features for effective text classification of small classes
Degree: M.Phil.
Year: 2008
Subject: Hong Kong Polytechnic University -- Dissertations.
Text processing (Computer science)
Semantics -- Data processing.
Department: Dept. of Computing
Pages: vii, 133 leaves : ill. ; 30 cm.
Language: English
InnoPac Record: http://library.polyu.edu.hk/record=b2239230
URI: http://theses.lib.polyu.edu.hk/handle/200/3096
Abstract: In real world environment, text classification through machine learning often faces special problems caused by small number of positive training samples and significantly skew distributions. The overwhelming number of negative samples and their features may significantly bias the classifier learning process. In addition, features that only appeared in negative samples may be irrelevant to the determination of the target class. In text classification, the difficulties caused by imbalance data are aggravated by the large number of features available. Hence finding a small number of good features is essential to improve the classification of small classes. Apart from the basic word tokens composite features like n-gram phrases and sparse phrases are possible source of good features. They can be generated by combining word tokens to represent the co-occurrence of multiple words and can provide more precise information to distinguish a class. However a major problem with this is the enormous size of the possible combinations. This thesis studies the efficient generation of effective composite features for text classification when the target class is small. We show that this can be done by focusing on features in positive samples and by a heuristic based exploration of the composite features space. Experimental results in our study showed the features in positive samples could offer comparable performance to the features in all samples. At the same time by focusing on positive samples the number of features used could be greatly reduced. This simple application of sampling concept on feature selection offers a key to speed up the feature exploration. Furthermore, by applying several proposed techniques, together with the concept of evolutionary approach, a heuristics-based method was developed to efficiently explore space of composite features. The flexibility of this approach made it feasible to search for an optimal set of features in the very large space of composite features with limited resources. The effectiveness of our approach on classification, particularly small class classification, was evaluated and compared using different classifiers and a commonly used data set. In general, our experiments showed our approach was able to produce high quality composite features by generating and examining a much smaller pool of features than otherwise possible.

Files in this item

Files Size Format
b22392300.pdf 4.538Mb PDF
Copyright Undertaking
As a bona fide Library user, I declare that:
  1. I will abide by the rules and legal ordinances governing copyright regarding the use of the Database.
  2. I will use the Database for the purpose of my research or private study only and not for circulation or further reproduction or any other purpose.
  3. I agree to indemnify and hold the University harmless from and against any loss, damage, cost, liability or expenses arising from copyright infringement or unauthorized usage.
By downloading any item(s) listed above, you acknowledge that you have read and understood the copyright undertaking as stated above, and agree to be bound by all of its terms.

     

Quick Search

Browse

More Information