Full metadata record
DC FieldValueLanguage
dc.contributorDepartment of Computingen_US
dc.creatorSze, Chun-ming Donahue-
dc.identifier.urihttps://theses.lib.polyu.edu.hk/handle/200/3756-
dc.languageEnglishen_US
dc.publisherHong Kong Polytechnic University-
dc.rightsAll rights reserveden_US
dc.titlePreprocessing frameworks for threaded discussion analysis by graphical probabilistic modelingen_US
dcterms.abstractUser generated content (UGC) has become the fastest growing sector of the World Wide Web. Today, one major type of massive UGC data is generated from web forums. The web forum, similar to USENET, is a bulletin board commonly used by users to exchange ideas, publish topics, or simply send replies via the HTML based browser. Since almost all computers are equipped with the pre-installed browser and can be easily accessed, the web forum has become more popular, and is considered as a significant contributor of the UGC data. With the growing importance of such web forum data, there are increasing and compelling needs to develop techniques to help analyze such tons of data, for example, grouping them in a meaningful and an user-friendly manner. Recently, Bayesian methods have grown from specialist niche to mainstream in the field of pattern recognition and machine learning. The graphical probabilistic model (GPM), induced by probability and graph theories, offers numerous useful properties to analyze data by using diagrammatic representations of probability distributions under the Bayesian perspective. By using effective algorithms like Gibbs Sampling, one may formulate topical problems (e.g. hot topics in a forum) in the latent variable model and obtains quality results in a tractable manner. In addition, we may also infer the relationship between different textual type variables (e.g. author, entity, word, and sentiment) in the Markov random fields. To analyze the web forum, one of the easiest ways is to directly convert a post or a thread as a bag of words (BOW) vector space representation and perform one of the graphical probabilistic modeling for instance latent variable modeling (for topical modeling) or Markov random fields (for non-topical modeling). However, the transformation of bag of words of threaded text may lead to a serious loss of important information, making the analysis or mining process ineffective. By using different graph models and inference techniques, we can develop a set of preprocessing frameworks to facilitate the analysis of web forum data. In topical modeling, we propose a framework for word-thread matrix formation. In order to provide more representative bag of words for latent variable modeling, our framework is designed to measure both implicit and explicit relationships between posts and replies. It consists two parts. In the first part, a threaded text is transformed to a directed acyclic graph (DAG) by a set of feature link generation functions. In the second part, different graph based ranking algorithms can be applied. Our framework, then, extracts a list of words by weighting the importance ranking value with traditional feature selection method. In non-topical modeling, on the other hand, we propose a distributional similarity model (DSM) to analyze the relationship between different textual type variables of a thread in the Markov random fields. This model is employed to measure not only the co-occurrence but also a distributional similarity in different type of distance level commonly found in threaded text. Empirical results obtained for the Hong Kong popular web forums show that the proposed methods are effective.en_US
dcterms.extent103 leaves : ill. (some col.) ; 30 cm.en_US
dcterms.isPartOfPolyU Electronic Thesesen_US
dcterms.issued2009en_US
dcterms.educationalLevelAll Masteren_US
dcterms.educationalLevelM.Phil.en_US
dcterms.LCSHHong Kong Polytechnic University -- Dissertations.en_US
dcterms.LCSHGraphical modeling (Statistics)en_US
dcterms.LCSHUser-generated content.en_US
dcterms.LCSHWorld Wide Web.en_US
dcterms.accessRightsopen accessen_US

Files in This Item:
File Description SizeFormat 
b23214284.pdfFor All Users1.99 MBAdobe PDFView/Open


Copyright Undertaking

As a bona fide Library user, I declare that:

  1. I will abide by the rules and legal ordinances governing copyright regarding the use of the Database.
  2. I will use the Database for the purpose of my research or private study only and not for circulation or further reproduction or any other purpose.
  3. I agree to indemnify and hold the University harmless from and against any loss, damage, cost, liability or expenses arising from copyright infringement or unauthorized usage.

By downloading any item(s) listed above, you acknowledge that you have read and understood the copyright undertaking as stated above, and agree to be bound by all of its terms.

Show simple item record

Please use this identifier to cite or link to this item: https://theses.lib.polyu.edu.hk/handle/200/3756