Preprocessing frameworks for threaded discussion analysis by graphical probabilistic modeling

Sze, Chun-ming Donahue

Author:	Sze, Chun-ming Donahue
Title:	Preprocessing frameworks for threaded discussion analysis by graphical probabilistic modeling
Degree:	M.Phil.
Year:	2009
Subject:	Hong Kong Polytechnic University -- Dissertations. Graphical modeling (Statistics) User-generated content. World Wide Web.
Department:	Department of Computing
Pages:	103 leaves : ill. (some col.) ; 30 cm.
Language:	English
Abstract:	User generated content (UGC) has become the fastest growing sector of the World Wide Web. Today, one major type of massive UGC data is generated from web forums. The web forum, similar to USENET, is a bulletin board commonly used by users to exchange ideas, publish topics, or simply send replies via the HTML based browser. Since almost all computers are equipped with the pre-installed browser and can be easily accessed, the web forum has become more popular, and is considered as a significant contributor of the UGC data. With the growing importance of such web forum data, there are increasing and compelling needs to develop techniques to help analyze such tons of data, for example, grouping them in a meaningful and an user-friendly manner. Recently, Bayesian methods have grown from specialist niche to mainstream in the field of pattern recognition and machine learning. The graphical probabilistic model (GPM), induced by probability and graph theories, offers numerous useful properties to analyze data by using diagrammatic representations of probability distributions under the Bayesian perspective. By using effective algorithms like Gibbs Sampling, one may formulate topical problems (e.g. hot topics in a forum) in the latent variable model and obtains quality results in a tractable manner. In addition, we may also infer the relationship between different textual type variables (e.g. author, entity, word, and sentiment) in the Markov random fields. To analyze the web forum, one of the easiest ways is to directly convert a post or a thread as a bag of words (BOW) vector space representation and perform one of the graphical probabilistic modeling for instance latent variable modeling (for topical modeling) or Markov random fields (for non-topical modeling). However, the transformation of bag of words of threaded text may lead to a serious loss of important information, making the analysis or mining process ineffective. By using different graph models and inference techniques, we can develop a set of preprocessing frameworks to facilitate the analysis of web forum data. In topical modeling, we propose a framework for word-thread matrix formation. In order to provide more representative bag of words for latent variable modeling, our framework is designed to measure both implicit and explicit relationships between posts and replies. It consists two parts. In the first part, a threaded text is transformed to a directed acyclic graph (DAG) by a set of feature link generation functions. In the second part, different graph based ranking algorithms can be applied. Our framework, then, extracts a list of words by weighting the importance ranking value with traditional feature selection method. In non-topical modeling, on the other hand, we propose a distributional similarity model (DSM) to analyze the relationship between different textual type variables of a thread in the Markov random fields. This model is employed to measure not only the co-occurrence but also a distributional similarity in different type of distance level commonly found in threaded text. Empirical results obtained for the Hong Kong popular web forums show that the proposed methods are effective.
Rights:	All rights reserved
Access:	open access

Files in This Item:

File	Description	Size	Format
b23214284.pdf	For All Users	1.99 MB	Adobe PDF	View/Open

Copyright Undertaking

As a bona fide Library user, I declare that:

I will abide by the rules and legal ordinances governing copyright regarding the use of the Database.
I will use the Database for the purpose of my research or private study only and not for circulation or further reproduction or any other purpose.
I agree to indemnify and hold the University harmless from and against any loss, damage, cost, liability or expenses arising from copyright infringement or unauthorized usage.

By downloading any item(s) listed above, you acknowledge that you have read and understood the copyright undertaking as stated above, and agree to be bound by all of its terms.

Show full item record

Please use this identifier to cite or link to this item: https://theses.lib.polyu.edu.hk/handle/200/3756