New document-context term weights and clustering for information retrieval

Pao Yue-kong Library Electronic Theses Database

New document-context term weights and clustering for information retrieval

 

Author: Dang, Kai-fung Edward
Title: New document-context term weights and clustering for information retrieval
Degree: Ph.D.
Year: 2010
Subject: Hong Kong Polytechnic University -- Dissertations
Information retrieval
Department: Dept. of Computing
Pages: xiii, 160 p. : ill. ; 30 cm.
InnoPac Record: http://library.polyu.edu.hk/record=b2393034
URI: http://theses.lib.polyu.edu.hk/handle/200/5901
Abstract: In this thesis we investigate new methods to deal with the polysemy and word mismatch problems in information retrieval (IR). We tackle polysemy by using 'document-contexts', which are text windows centred on query terms in a document. Analysis of the words in the vicinity of a query term can identify its specific meaning in the context. In IR, many of the commonly used term weights are variants of the TF-IDF form. The tradition TF-IDF weight of a term depends only on the occurrence statistics of the term itself. We have studied a novel 'context-dependent' term weight, which incorporates information based on the words found in the document-contexts of a term. These term weights are generated by a Boost and Discount (B&D) procedure, which utilizes any relevance information that is available to estimate the probability of relevance of a context. Such relevance information may come from actual relevance judgments that a user makes on a (small) number of documents, as in 'relevance feedback' (RF). The theoretical justification of our scheme to calculate the new term weights is provided by a probabilistic non-relevance decision model of IR. We present experiments in the RF setting to test the context-dependent term weights. We demonstrate that using the new term weights can yield statistically significant improvement in retrieval compared with the traditional weights. Regarding the word mismatch problem, one plausible solution is to use clustering techniques. A traditional clustering evaluation measure used in IR is the MK1, which is a score calculated for the single 'optimal cluster' that can be extracted from the clus-tering result. MK1 is appropriate if a single retrieved cluster is desired. However, in some applications it may be desirable for the retrieval results to be presented in multiple clusters according to sub-topics. For this case, we introduce a new evaluation measure, called CS, which corresponds to finding an optimal combination of clusters. We define a sub-class of CS, called CS1, applicable when the clusters are disjoint. By reformulating the optimization to a 0-1 linear fractional programming problem, we demonstrate that an exact solution of CS1 can be obtained by a linear time algorithm. We discuss how our approach can be generalized to overlapping clusters, and present greedy algorithms to obtain optimal estimates. We claim that one particular 'cost effectiveness' algorithm yields the global optimal solution for clusters that overlap only by nesting. A mathematical proof of this claim by induction is presented. We have also investigated whether clustering techniques can further improve the retrieval effectiveness in relevance feedback using context-dependent term weights. B&D utilizes information extracted from the judged documents to provide evidence of relevance or non-relevance in the unseen documents. We use clustering to seek contexts from unseen documents that are similar to those in the judged documents. In this way, additional relevance information can be obtained for B&D. Experiments on the TREC-2005 collection show that a 'clustered SVM' scheme is effective in further improving relevance feedback effectiveness as compared to standard B&D, yielding small but statistically significant improvements in MAP. Thus, this is a promising direction for further research.

Files in this item

Files Size Format
b23930342.pdf 1.438Mb PDF
Copyright Undertaking
As a bona fide Library user, I declare that:
  1. I will abide by the rules and legal ordinances governing copyright regarding the use of the Database.
  2. I will use the Database for the purpose of my research or private study only and not for circulation or further reproduction or any other purpose.
  3. I agree to indemnify and hold the University harmless from and against any loss, damage, cost, liability or expenses arising from copyright infringement or unauthorized usage.
By downloading any item(s) listed above, you acknowledge that you have read and understood the copyright undertaking as stated above, and agree to be bound by all of its terms.

     

Quick Search

Browse

More Information