Discovering clusters in databases using an evolutionary approach

Pao Yue-kong Library Electronic Theses Database

Discovering clusters in databases using an evolutionary approach

 

Author: Chung, Lap-hang Lewis
Title: Discovering clusters in databases using an evolutionary approach
Year: 2000
Subject: Data mining
Hong Kong Polytechnic University -- Dissertations
Department: Dept. of Computing
Pages: vii, 123 leaves : ill. ; 30 cm
Language: English
InnoPac Record: http://library.polyu.edu.hk/record=b1535255
URI: http://theses.lib.polyu.edu.hk/handle/200/3186
Abstract: Data Mining is concerned with the nontrivial extraction of implicit, previously unknown, and potentially useful information from data. The grand challenge of data mining is to collectively handle the problems imposed by the nature of real-world databases which tend to be dynamic, incomplete, redundant, noisy, and very large. Among many different problems that data mining is concerned with, the problem of discovering clusters in databases has recently received more attention. Clustering problem is concerned with the discovering of meaningful groupings of data records in a database based on their attribute values. The ability to do so can have many applications in many different areas in business and finance, computing and engineering, natural and social science, etc. Many of the existing clustering techniques were developed to handle a special type of data mining problem called spatial data mining. The databases that are involved contain continuous-valued records and the techniques that are used are, by and large, based on distance measures that can be defined in the Euclidean space. In other words, these techniques are not very useful when employed to handle mixed continuous- and discrete-valued data records. For clustering techniques that can be used to deal with mixed data, many of them use different distance measures for continuous and discrete-valued data separately. Moreover, they are not good at handling data records that are noisy and that contain missing values. They are also not able to discover clusters whose boundaries overlap. Furthermore, many of them do not make explicit the characteristics of each cluster discovered or the differences between them and this makes the result difficult to interpret and use. To overcome these problems, we propose a new clustering algorithm in this thesis. When compared with existing algorithms, it has several advantageous features. It is able to (i) handle mixed continuous- and discrete-valued data; (ii) discover overlapping clusters; (iii) perform data transformation; (iv) handle noisy and missing value; and (v) explicitly represent the characteristics of each discovered clusters. The proposed clustering algorithm is based on the use of a simple genetic algorithm (GA). By representing a cluster label as a gene, particular grouping of records is encoded in a chromosome. Once different groupings are generated, the most interesting chromosomes are then evolved using the operators of selection, crossover, and mutation. To determine how interesting the chromosomes in the whole population is, all of them are evaluated by a fitness function. The fitness function is defined in terms of a probabilistic similarity measure and can be interpreted as an objective measure of interestingness of the rules that characterize the particular grouping in a chromosome. Since the similarity measure is probabilistic, it can be defined when the data being dealt with contains noisy, dynamic, incomplete, missing, or even erroneous values. In addition, the defined measure enables us to discover overlapping clusters. The ultimate goal of the evolutionary process is, therefore, to identify the fittest chromosome by maximizing the fitness function. During the process, it should be noted that a set of rules is discovered to characterize the specific grouping encoded in a chromosome. This non-black-box approach makes it possible for the patterns underlying the databases to be made explicit. To evaluate the performance of the proposed clustering approach, we used many sets of real and simulated data in our experimentation. In addition, for comparison with existing methods, we have also introduced an evaluation criterion. The results of the experiments show that the proposed approach is able to handle real problem more effectively.

Files in this item

Files Size Format
b15352559.pdf 4.797Mb PDF
Copyright Undertaking
As a bona fide Library user, I declare that:
  1. I will abide by the rules and legal ordinances governing copyright regarding the use of the Database.
  2. I will use the Database for the purpose of my research or private study only and not for circulation or further reproduction or any other purpose.
  3. I agree to indemnify and hold the University harmless from and against any loss, damage, cost, liability or expenses arising from copyright infringement or unauthorized usage.
By downloading any item(s) listed above, you acknowledge that you have read and understood the copyright undertaking as stated above, and agree to be bound by all of its terms.

     

Quick Search

Browse

More Information