Author: Chong, Fu-shan
Title: Fast Chinese approximate search
Degree: M.Sc.
Year: 2001
Subject: Chinese character sets (Data processing)
Hong Kong Polytechnic University -- Dissertations
Department: Multi-disciplinary Studies
Department of Computing
Pages: 48, [15] leaves : ill. ; 30 cm
Language: English
Abstract: Nowadays, almost every kind of business area had been computerized. In many business areas where information may be required to be interchanged between different entities, e.g. the banking system, those entities may have different kind of computer systems and different layout of data. For example, personal name may be stored as a string field in one system and a surname string field, an other name string field in another system. Hence, the data being interchanged may have certain extent of difference which is introduced by data conversion or spelling mistake. A central database system storing the key information can be applied to perfectly solve the problem. But this is impossible to be archived in many business areas due to the data privacy and politics. Hence, intelligence approximate string matching may be introduced to solve the problem [11]. Information could then be interchanged with certain degree of freedom. However, one-to-many or one-to-zero record may be introduced, during matching between system interchange, due to the minor difference and different format standards. To cater the case and minimize the incorrect search text matching, several steps would be introduced. Firstly, reformat the input string based on some sort of key words, such as 有限公司, where the major key words in the string could be isolated for the next steps. These rules would be context dependent such that one for personal name, one for company name and one for address. These rules are to be determined. The next important step for matching is the ways to compare between strings with major key words isolated. Many different rules could be applied here depends on the business area and the knowledge of the users, etc. There are two common ways to weight the matching would be done by pronunciation and character appearance of the major key words. These rules are to be determined. The searching tools for English name had been developed with a very flexible way, e.g. IBM' algorithm [13], but not for Chinese name. This is because Chinese language is difference from the Indo-European language system. Besides the above features, the search engine should be run at full speed such that the on-line searching which is a great challenge to the algorithm chosen for implementation to meet the requirement. Also the search engine should fully utilize the functionality provided by relational database system, such as B+ index [1], for easy management. This searching engine could be applied in a wide area of application, such as internet searching, personal matching, etc.
Rights: All rights reserved
Access: restricted access

Files in This Item:
File Description SizeFormat 
b15996098.pdfFor All Users (off-campus access for PolyU Staff & Students only)1.83 MBAdobe PDFView/Open


Copyright Undertaking

As a bona fide Library user, I declare that:

  1. I will abide by the rules and legal ordinances governing copyright regarding the use of the Database.
  2. I will use the Database for the purpose of my research or private study only and not for circulation or further reproduction or any other purpose.
  3. I agree to indemnify and hold the University harmless from and against any loss, damage, cost, liability or expenses arising from copyright infringement or unauthorized usage.

By downloading any item(s) listed above, you acknowledge that you have read and understood the copyright undertaking as stated above, and agree to be bound by all of its terms.

Show full item record

Please use this identifier to cite or link to this item: https://theses.lib.polyu.edu.hk/handle/200/2795