A comparative study on term weighting schemes for text categorization using support vector machine

Pao Yue-kong Library Electronic Theses Database

A comparative study on term weighting schemes for text categorization using support vector machine


Author: Sun, Mengyu
Title: A comparative study on term weighting schemes for text categorization using support vector machine
Degree: M.Sc.
Year: 2015
Subject: Text processing (Computer science)
Artificial intelligence.
Hong Kong Polytechnic University -- Dissertations
Department: Dept. of Computing
Pages: vii, 66 leaves : illustrations ; 30 cm
Language: English
OneSearch: https://www.lib.polyu.edu.hk/bib/b2780019
URI: http://theses.lib.polyu.edu.hk/handle/200/7841
Abstract: Text categorization is the progress of automatic classification using a computer (or other entities or objects) to dealing with text set according to certain classification system or standard. In text categorization, most often used method is to converting the textual documents into form of vectors so that the document could be identified and processed by computers. Therefore, term weighting schemes become a significant step in text categorization. Term weighting methods can calculate term values in documents. Selecting an appropriate term weighting schemes contribute large in accuracy of automatically text categorization. In this article, we compared three widely used term weighting schemes in our experiment incorporation with SVM algorithm using text of hot events to do binary classification dividing sentiment of the documents into positive category or negative category. The results of the three term weighting methods are evaluated by three indicators that respectively are F-Value, Recall and Precision. We say the term weighting scheme is effective and the result is better when the value of the indicator getting close to 1. In the experiments, all of the three term weighting schemes performed well as most of the times, they reached an accuracy of more than seventy percent. The controlled experiment showed that the TF-IDF scheme showed a consistently better performance than the other two term weighting methods. Making a comparison on these three schemes, the IDF factor is more effective. That is to say the IDF factor improved the term's discriminating power for text categorization. While, on the other hand, TF-CHI scheme underperformed of the three.

Files in this item

Files Size Format
b27800192.pdf 2.672Mb PDF
Copyright Undertaking
As a bona fide Library user, I declare that:
  1. I will abide by the rules and legal ordinances governing copyright regarding the use of the Database.
  2. I will use the Database for the purpose of my research or private study only and not for circulation or further reproduction or any other purpose.
  3. I agree to indemnify and hold the University harmless from and against any loss, damage, cost, liability or expenses arising from copyright infringement or unauthorized usage.
By downloading any item(s) listed above, you acknowledge that you have read and understood the copyright undertaking as stated above, and agree to be bound by all of its terms.


Quick Search


More Information