Full metadata record
DC Field | Value | Language |
---|---|---|
dc.contributor | Department of Computing | en_US |
dc.creator | Xu, Ruifeng | - |
dc.identifier.uri | https://theses.lib.polyu.edu.hk/handle/200/3750 | - |
dc.language | English | en_US |
dc.publisher | Hong Kong Polytechnic University | - |
dc.rights | All rights reserved | en_US |
dc.title | Post-processing for handwritten Chinese character recognition | en_US |
dcterms.abstract | Some post-processing techniques for improving the performance of Handwritten Chinese Character Recognition (HCCR) system by selecting the most promising candidate characters are presented here. Aiming to remove mis-recognized and unrecognized characters in the recognition result, three post-processing approaches, namely the one based on contextual linguistics information, the one based on confusing character characteristics produced by a recognizer, and the one based on a hybrid approach, are studied in this thesis and their performance are evaluated and compared. In the study of the post-processing approach based on contextual linguistics information, the dictionary-based post-processing method is presented. The dictionary-based techniques, including sentence fragments detection and contextual approximate word matching for removing erroneous characters, are studied and its performance is evaluated. Post-processing Techniques based on statistical language models are then proposed. A Chinese word BI-gram model is established and employed in HCCR post-processing to identify a most linguistic-promising sentence with the maximum word co-occurrence production by selecting plausible candidate characters. To obtain the description capacity of long-distance restrictions among Chinese sentences, the word BI-Gram model is extended to a distant word BI-Gram model with a maximum distance 3 and prior to post-processing. Their upgrading performances are evaluated and compared. To recover the unrecognized characters and enhance the theoretical upper improvement limit for the post-processing approach based on contextual linguistics information, the post-processing techniques based on the characteristics of confusing characters produced by recognizer are studied. Analyzing the recognition results for the training samples, the confusing characters for each character category are collected and constructed into a confusing character set. Based on this set, a statistical Noisy-Channel model is used to identify the most promising input character when a candidate sequence is given. This method proves to be effective in removing unrecognized characters. Considering the confusing characters as observed features of character categories, the classification algorithm based on neural networks can be employed to identify the most plausible input as the production of the candidate sequence. All together 3755 character categories in GB2312-80 character-set are clustered into several hundred groups after searching through the transitive closure of the similarity matrix associated with the confusing character set. A group of neural networks for these category groups are established and trained to produce a candidate to match the input character and to adjust the confidence parameter of candidates for a given candidate sequence. A better performance in comparing with the one based on Noisy-Channel model is achieved. A three-stage hybrid post-processing system is then built. The post-processing technique based on confusing character characteristics of a recognizer is firstly conducted to append similar-shaped characters into the candidate set. Then the dictionary-based method is employed to append linguistic-prone characters and bind the candidate characters into a word-lattice. Finally the statistical language model is applied to identify a most promising sentence by selecting plausible words from the word-lattice. On the average, this hybrid post-processing system achieves 6.2% recognition rate improvement for the first candidate when the character recognition rate is 90% for the first candidate and 95% for the top-10 candidates by online HCCR engine. For the offline HCCR engine with the original recognition rate of 81% and 92% for the first and the top-l0 candidates, 12% recognition rate improvement for the first candidate is achieved. | en_US |
dcterms.extent | xiii, 162 leaves : col. ill. ; 30 cm | en_US |
dcterms.isPartOf | PolyU Electronic Theses | en_US |
dcterms.issued | 2001 | en_US |
dcterms.educationalLevel | All Master | en_US |
dcterms.educationalLevel | M.Phil. | en_US |
dcterms.LCSH | Optical character recognition devices | en_US |
dcterms.LCSH | Chinese character sets (Data processing) | en_US |
dcterms.LCSH | Chinese characters -- Data processing | en_US |
dcterms.LCSH | Pattern recognition systems | en_US |
dcterms.LCSH | Optical data processing | en_US |
dcterms.LCSH | Hong Kong Polytechnic University -- Dissertations | en_US |
dcterms.accessRights | open access | en_US |
Files in This Item:
File | Description | Size | Format | |
---|---|---|---|---|
b15731807.pdf | For All Users | 8.4 MB | Adobe PDF | View/Open |
Copyright Undertaking
As a bona fide Library user, I declare that:
- I will abide by the rules and legal ordinances governing copyright regarding the use of the Database.
- I will use the Database for the purpose of my research or private study only and not for circulation or further reproduction or any other purpose.
- I agree to indemnify and hold the University harmless from and against any loss, damage, cost, liability or expenses arising from copyright infringement or unauthorized usage.
By downloading any item(s) listed above, you acknowledge that you have read and understood the copyright undertaking as stated above, and agree to be bound by all of its terms.
Please use this identifier to cite or link to this item:
https://theses.lib.polyu.edu.hk/handle/200/3750