Author: | Sun, Mengyu |
Title: | A comparative study on term weighting schemes for text categorization using support vector machine |
Advisors: | Chan, Keith C. C. (COMP) |
Degree: | M.Sc. |
Year: | 2015 |
Subject: | Text processing (Computer science) Artificial intelligence. Hong Kong Polytechnic University -- Dissertations |
Department: | Department of Computing |
Pages: | vii, 66 leaves : illustrations ; 30 cm |
Language: | English |
Abstract: | Text categorization is the progress of automatic classification using a computer (or other entities or objects) to dealing with text set according to certain classification system or standard. In text categorization, most often used method is to converting the textual documents into form of vectors so that the document could be identified and processed by computers. Therefore, term weighting schemes become a significant step in text categorization. Term weighting methods can calculate term values in documents. Selecting an appropriate term weighting schemes contribute large in accuracy of automatically text categorization. In this article, we compared three widely used term weighting schemes in our experiment incorporation with SVM algorithm using text of hot events to do binary classification dividing sentiment of the documents into positive category or negative category. The results of the three term weighting methods are evaluated by three indicators that respectively are F-Value, Recall and Precision. We say the term weighting scheme is effective and the result is better when the value of the indicator getting close to 1. In the experiments, all of the three term weighting schemes performed well as most of the times, they reached an accuracy of more than seventy percent. The controlled experiment showed that the TF-IDF scheme showed a consistently better performance than the other two term weighting methods. Making a comparison on these three schemes, the IDF factor is more effective. That is to say the IDF factor improved the term's discriminating power for text categorization. While, on the other hand, TF-CHI scheme underperformed of the three. |
Rights: | All rights reserved |
Access: | restricted access |
Files in This Item:
File | Description | Size | Format | |
---|---|---|---|---|
b27800192.pdf | For All Users (off-campus access for PolyU Staff & Students only) | 2.61 MB | Adobe PDF | View/Open |
Copyright Undertaking
As a bona fide Library user, I declare that:
- I will abide by the rules and legal ordinances governing copyright regarding the use of the Database.
- I will use the Database for the purpose of my research or private study only and not for circulation or further reproduction or any other purpose.
- I agree to indemnify and hold the University harmless from and against any loss, damage, cost, liability or expenses arising from copyright infringement or unauthorized usage.
By downloading any item(s) listed above, you acknowledge that you have read and understood the copyright undertaking as stated above, and agree to be bound by all of its terms.
Please use this identifier to cite or link to this item:
https://theses.lib.polyu.edu.hk/handle/200/7841