Author:  Dang, Kaifung Edward 
Title:  New documentcontext term weights and clustering for information retrieval 
Degree:  Ph.D. 
Year:  2010 
Subject:  Hong Kong Polytechnic University  Dissertations Information retrieval 
Department:  Dept. of Computing 
Pages:  xiii, 160 p. : ill. ; 30 cm. 
InnoPac Record:  http://library.polyu.edu.hk/record=b2393034 
URI:  http://theses.lib.polyu.edu.hk/handle/200/5901 
Abstract:  In this thesis we investigate new methods to deal with the polysemy and word mismatch problems in information retrieval (IR). We tackle polysemy by using 'documentcontexts', which are text windows centred on query terms in a document. Analysis of the words in the vicinity of a query term can identify its specific meaning in the context. In IR, many of the commonly used term weights are variants of the TFIDF form. The tradition TFIDF weight of a term depends only on the occurrence statistics of the term itself. We have studied a novel 'contextdependent' term weight, which incorporates information based on the words found in the documentcontexts of a term. These term weights are generated by a Boost and Discount (B&D) procedure, which utilizes any relevance information that is available to estimate the probability of relevance of a context. Such relevance information may come from actual relevance judgments that a user makes on a (small) number of documents, as in 'relevance feedback' (RF). The theoretical justification of our scheme to calculate the new term weights is provided by a probabilistic nonrelevance decision model of IR. We present experiments in the RF setting to test the contextdependent term weights. We demonstrate that using the new term weights can yield statistically significant improvement in retrieval compared with the traditional weights. Regarding the word mismatch problem, one plausible solution is to use clustering techniques. A traditional clustering evaluation measure used in IR is the MK1, which is a score calculated for the single 'optimal cluster' that can be extracted from the clustering result. MK1 is appropriate if a single retrieved cluster is desired. However, in some applications it may be desirable for the retrieval results to be presented in multiple clusters according to subtopics. For this case, we introduce a new evaluation measure, called CS, which corresponds to finding an optimal combination of clusters. We define a subclass of CS, called CS1, applicable when the clusters are disjoint. By reformulating the optimization to a 01 linear fractional programming problem, we demonstrate that an exact solution of CS1 can be obtained by a linear time algorithm. We discuss how our approach can be generalized to overlapping clusters, and present greedy algorithms to obtain optimal estimates. We claim that one particular 'cost effectiveness' algorithm yields the global optimal solution for clusters that overlap only by nesting. A mathematical proof of this claim by induction is presented. We have also investigated whether clustering techniques can further improve the retrieval effectiveness in relevance feedback using contextdependent term weights. B&D utilizes information extracted from the judged documents to provide evidence of relevance or nonrelevance in the unseen documents. We use clustering to seek contexts from unseen documents that are similar to those in the judged documents. In this way, additional relevance information can be obtained for B&D. Experiments on the TREC2005 collection show that a 'clustered SVM' scheme is effective in further improving relevance feedback effectiveness as compared to standard B&D, yielding small but statistically significant improvements in MAP. Thus, this is a promising direction for further research. 
Files  Size  Format 

b23930342.pdf  1.438Mb 


As a bona fide Library user, I declare that:  


By downloading any item(s) listed above, you acknowledge that you have read and understood the copyright undertaking as stated above, and agree to be bound by all of its terms. 