Author: Xu, Ruifeng
Title: Post-processing for handwritten Chinese character recognition
Degree: M.Phil.
Year: 2001
Subject: Optical character recognition devices
Chinese character sets (Data processing)
Chinese characters -- Data processing
Pattern recognition systems
Optical data processing
Hong Kong Polytechnic University -- Dissertations
Department: Department of Computing
Pages: xiii, 162 leaves : col. ill. ; 30 cm
Language: English
Abstract: Some post-processing techniques for improving the performance of Handwritten Chinese Character Recognition (HCCR) system by selecting the most promising candidate characters are presented here. Aiming to remove mis-recognized and unrecognized characters in the recognition result, three post-processing approaches, namely the one based on contextual linguistics information, the one based on confusing character characteristics produced by a recognizer, and the one based on a hybrid approach, are studied in this thesis and their performance are evaluated and compared. In the study of the post-processing approach based on contextual linguistics information, the dictionary-based post-processing method is presented. The dictionary-based techniques, including sentence fragments detection and contextual approximate word matching for removing erroneous characters, are studied and its performance is evaluated. Post-processing Techniques based on statistical language models are then proposed. A Chinese word BI-gram model is established and employed in HCCR post-processing to identify a most linguistic-promising sentence with the maximum word co-occurrence production by selecting plausible candidate characters. To obtain the description capacity of long-distance restrictions among Chinese sentences, the word BI-Gram model is extended to a distant word BI-Gram model with a maximum distance 3 and prior to post-processing. Their upgrading performances are evaluated and compared. To recover the unrecognized characters and enhance the theoretical upper improvement limit for the post-processing approach based on contextual linguistics information, the post-processing techniques based on the characteristics of confusing characters produced by recognizer are studied. Analyzing the recognition results for the training samples, the confusing characters for each character category are collected and constructed into a confusing character set. Based on this set, a statistical Noisy-Channel model is used to identify the most promising input character when a candidate sequence is given. This method proves to be effective in removing unrecognized characters. Considering the confusing characters as observed features of character categories, the classification algorithm based on neural networks can be employed to identify the most plausible input as the production of the candidate sequence. All together 3755 character categories in GB2312-80 character-set are clustered into several hundred groups after searching through the transitive closure of the similarity matrix associated with the confusing character set. A group of neural networks for these category groups are established and trained to produce a candidate to match the input character and to adjust the confidence parameter of candidates for a given candidate sequence. A better performance in comparing with the one based on Noisy-Channel model is achieved. A three-stage hybrid post-processing system is then built. The post-processing technique based on confusing character characteristics of a recognizer is firstly conducted to append similar-shaped characters into the candidate set. Then the dictionary-based method is employed to append linguistic-prone characters and bind the candidate characters into a word-lattice. Finally the statistical language model is applied to identify a most promising sentence by selecting plausible words from the word-lattice. On the average, this hybrid post-processing system achieves 6.2% recognition rate improvement for the first candidate when the character recognition rate is 90% for the first candidate and 95% for the top-10 candidates by online HCCR engine. For the offline HCCR engine with the original recognition rate of 81% and 92% for the first and the top-l0 candidates, 12% recognition rate improvement for the first candidate is achieved.
Rights: All rights reserved
Access: open access

Files in This Item:
File Description SizeFormat 
b15731807.pdfFor All Users8.4 MBAdobe PDFView/Open


Copyright Undertaking

As a bona fide Library user, I declare that:

  1. I will abide by the rules and legal ordinances governing copyright regarding the use of the Database.
  2. I will use the Database for the purpose of my research or private study only and not for circulation or further reproduction or any other purpose.
  3. I agree to indemnify and hold the University harmless from and against any loss, damage, cost, liability or expenses arising from copyright infringement or unauthorized usage.

By downloading any item(s) listed above, you acknowledge that you have read and understood the copyright undertaking as stated above, and agree to be bound by all of its terms.

Show full item record

Please use this identifier to cite or link to this item: https://theses.lib.polyu.edu.hk/handle/200/3750