Author: | Chong, Fu-shan |
Title: | Fast Chinese approximate search |
Degree: | M.Sc. |
Year: | 2001 |
Subject: | Chinese character sets (Data processing) Hong Kong Polytechnic University -- Dissertations |
Department: | Multi-disciplinary Studies Department of Computing |
Pages: | 48, [15] leaves : ill. ; 30 cm |
Language: | English |
Abstract: | Nowadays, almost every kind of business area had been computerized. In many business areas where information may be required to be interchanged between different entities, e.g. the banking system, those entities may have different kind of computer systems and different layout of data. For example, personal name may be stored as a string field in one system and a surname string field, an other name string field in another system. Hence, the data being interchanged may have certain extent of difference which is introduced by data conversion or spelling mistake. A central database system storing the key information can be applied to perfectly solve the problem. But this is impossible to be archived in many business areas due to the data privacy and politics. Hence, intelligence approximate string matching may be introduced to solve the problem [11]. Information could then be interchanged with certain degree of freedom. However, one-to-many or one-to-zero record may be introduced, during matching between system interchange, due to the minor difference and different format standards. To cater the case and minimize the incorrect search text matching, several steps would be introduced. Firstly, reformat the input string based on some sort of key words, such as 有限公司, where the major key words in the string could be isolated for the next steps. These rules would be context dependent such that one for personal name, one for company name and one for address. These rules are to be determined. The next important step for matching is the ways to compare between strings with major key words isolated. Many different rules could be applied here depends on the business area and the knowledge of the users, etc. There are two common ways to weight the matching would be done by pronunciation and character appearance of the major key words. These rules are to be determined. The searching tools for English name had been developed with a very flexible way, e.g. IBM' algorithm [13], but not for Chinese name. This is because Chinese language is difference from the Indo-European language system. Besides the above features, the search engine should be run at full speed such that the on-line searching which is a great challenge to the algorithm chosen for implementation to meet the requirement. Also the search engine should fully utilize the functionality provided by relational database system, such as B+ index [1], for easy management. This searching engine could be applied in a wide area of application, such as internet searching, personal matching, etc. |
Rights: | All rights reserved |
Access: | restricted access |
Files in This Item:
File | Description | Size | Format | |
---|---|---|---|---|
b15996098.pdf | For All Users (off-campus access for PolyU Staff & Students only) | 1.83 MB | Adobe PDF | View/Open |
Copyright Undertaking
As a bona fide Library user, I declare that:
- I will abide by the rules and legal ordinances governing copyright regarding the use of the Database.
- I will use the Database for the purpose of my research or private study only and not for circulation or further reproduction or any other purpose.
- I agree to indemnify and hold the University harmless from and against any loss, damage, cost, liability or expenses arising from copyright infringement or unauthorized usage.
By downloading any item(s) listed above, you acknowledge that you have read and understood the copyright undertaking as stated above, and agree to be bound by all of its terms.
Please use this identifier to cite or link to this item:
https://theses.lib.polyu.edu.hk/handle/200/2795