Advanced techniques for Chinese chunk segmentation and the similarity measure of Chinese sentences

Pao Yue-kong Library Electronic Theses Database

Advanced techniques for Chinese chunk segmentation and the similarity measure of Chinese sentences

 

Author: Wang, Rongbo
Title: Advanced techniques for Chinese chunk segmentation and the similarity measure of Chinese sentences
Degree: Ph.D.
Year: 2006
Subject: Hong Kong Polytechnic University -- Dissertations
Chinese language -- Data processing
Chinese language -- Sentences
Chinese language -- Word formation
Chinese language -- Machine translating
Department: Dept. of Electronic and Information Engineering
Pages: xviii, 156 leaves : ill. ; 30 cm
Language: English
InnoPac Record: http://library.polyu.edu.hk/record=b2069703
URI: http://theses.lib.polyu.edu.hk/handle/200/2660
Abstract: This thesis addresses two important problems in Chinese information processing, namely Chinese chunk segmentation and the similarity measure of Chinese sentences. The three main contributions reported in this thesis are: (1) a novel Chinese chunk segmentation technique using a statistical model combined with correction rules generated using an error-correction mechanism; (2) a novel similarity measure of Chinese sentences using both word/chunk sequences and POS (Part of Speech) tag sequences of Chinese sentences; and (3) the optimization of parameters used in the combined similarity measure approach by applying a relevance feedback technique and a neural network model. In the first investigation, a statistical model combined with correction rules generated by an error-correction mechanism is proposed for Chinese chunk segmentation. Chunk segmentation of Chinese sentences in the training corpus was carried out manually to provide a ground rule for training the statistical model with which preliminary chunk segmentation results will be obtained. The chunk segmentation result (correctly and incorrectly segmented chunks) from the statistical model is utilized to generate a set of correction rules for refining the segmentation result. This set of correction rules is generated by an error-correction mechanism in which a comparison between the preliminary segmentation result and the manually segmented result is performed. The statistical model and the learned correction rules can then be used to perform Chinese chunk segmentation of unseen sentences. In the second investigation, novel similarity measures of Chinese sentences are proposed by using word/chunk sequences and POS tag sequences of Chinese sentences. The sentence similarity measure is one of very important components in example-based machine translation (EBMT). For Chinese sentences there is no delimiter between any two words, which is different from English sentences. Hence, Chinese word/chunk delimitation should be performed first before a sentence similarity measure can be computed. Both word/chunk sequence feature and POS tag sequence feature used in our proposed similarity measures are based on word/chunk segmentation. Sentence structure information is partially reflected in the POS tag sequence. For the proposed word-sequence-matching-based (WSMB) method, we take into consideration three factors between two sentences: the number of identical word sequences, the length of each identical word sequence, and the average weighting (AW) of each identical word sequence. In computing AW, we weight every POS tag according to its importance. The POS-tag-sequence-matching-based (PTSMB) method is to measure the similarity of Chinese sentences in terms of their structures. If the constituents in two Chinese sentences are similar, then we can judge that these two Chinese sentences are similar in structure. The main idea of this similarity measure is that we perform matching between the POS' of two Chinese sentences using directed graphs. The POS weighting is also utilized in the process. In the third investigation, we propose a human-computer interaction approach to optimize parameters used in the combined similarity measure of Chinese sentences based on a relevance feedback scheme and a neural network model. In the relevance feedback process, users' intentions and preferences to rank the candidate sentences are captured and used to modify parameters in the similarity measure. For the parameter optimization research, a web-based questionnaire was designed to collect users' feedback data. In this pioneering study, we constructed 50 groups of sentences. There is one source sentence and ten sentences to be retrieved for every group. The ten test sentences are shown in descending order of similarity to the source sentence. The user is asked to provide a new rank according to his or her judgment if he/she does not agree with the ranking done by the computer. The new rank is converted to a set of numerals and stored in a database for the parameter optimization using a neural network model. One clear advantage of this approach is its ability to fine-tune the measure to reflect the user's or users' preferences in matching Chinese sentences. Experimental results show a visible improvement of the similarity measure performance. In addition to the theoretical and experimental studies in Chinese chunk segmentation and the similarity measure of Chinese sentences, we also implemented them into an EBMT prototype in which we also addressed other issues such as data structure, sentence indexing, and user-friendly interface design.

Files in this item

Files Size Format
b2069703x.pdf 2.061Mb PDF
Copyright Undertaking
As a bona fide Library user, I declare that:
  1. I will abide by the rules and legal ordinances governing copyright regarding the use of the Database.
  2. I will use the Database for the purpose of my research or private study only and not for circulation or further reproduction or any other purpose.
  3. I agree to indemnify and hold the University harmless from and against any loss, damage, cost, liability or expenses arising from copyright infringement or unauthorized usage.
By downloading any item(s) listed above, you acknowledge that you have read and understood the copyright undertaking as stated above, and agree to be bound by all of its terms.

     

Quick Search

Browse

More Information