Building a decision cluster classification model by a clustering algorithm to classify large high dimensional data with multiple classes

Li, Yan

Author:	Li, Yan
Title:	Building a decision cluster classification model by a clustering algorithm to classify large high dimensional data with multiple classes
Degree:	Ph.D.
Year:	2010
Subject:	Hong Kong Polytechnic University -- Dissertations Cluster analysis -- Data processing Dimensional analysis -- Data processing Computer algorithms Data mining
Department:	Department of Computing
Pages:	xv, 144 p. : ill. ; 30 cm.
Language:	English
Abstract:	Clustering and classification are two basic tasks in data mining. As the complexity of data increases, the existing techniques for classification face a lot of challenges, for instance, classifying large high dimensional data with multiple classes. Therefore, new techniques need to be innovated to deal with data in large volume and high dimensions. In this thesis, we aim to propose a possible way to solve this problem by integrating clustering algorithm into classification work. We propose a new classification framework. This framework consists of three phases: (i) a clustering algorithm is called recursively to build a decision cluster tree, (ii) a classification model is built from this decision cluster tree, (iii) new samples are classified by this classification model. There are many research problems existing in this framework. In this thesis, we describe our methodology for those problems. In this framework, we propose a new classification method ADCC (Automatic Decision Cluster Classifier) that is designed to use a variable weighting k-means algorithm W-k-means to build a decision cluster tree so that the variable weights of each dimension can be obtained from the training data and used in classification. In partitioning the training data, W-k-means automatically computes the variable weights according to the data distributions so that important variables can get more weights and the noisy variables get less weight. In clustering a data set (i.e., a node), the class variable is removed from the data, so the class variable has no impact on the clustering results. The class variable is used in determining the dominant class for each cluster. To build a better cluster tree, effective methods for selection of the number of clusters and the initial cluster centers at each node are introduced. Furthermore, we use various tests including Anderson-Darling test to determine whether a node can be further partitioned or not. In this way, distribution of the training samples at each node is considered together with the purity and the size of the node. A decision cluster classifier consists of a set of disjoint decision clusters, each labeled with a dominant class that determines the class of new objects falling in the cluster. A series of experiments on both synthetic and real data sets have been conducted. The results show that the new classification method (ADCC) performed better in accuracy and scalability than the existing methods of KNN, decision tree and SVM. It is particularly suitable for large, high dimensional data with many classes. Sometimes, ADCC method generates some weak decision clusters in which no single class dominates. Existence of weak decision clusters in the model can affect classification performance of the model. In a weak decision cluster, there is no dominant class, so it is difficult to justify the class of the new objects. It has been shown that classification accuracy could be improved after weak decision clusters were avoided from the model. Weak decision clusters occur because objects of different classes are mixed in the clustering process to generate decision clusters. If we assume that objects in the same class have their own cluster distributions, we can separate objects of different classes according to the object class labels and generate a decision cluster tree for each class of objects. Then, we combine the decision clusters of different classes to form the decision cluster classification model. In this way, weak decision clusters can be avoided. We propose a Decision Cluster Forest (DCF) method to build a set of decision cluster trees (decision cluster forest) which form a classification model. Instead of building a single decision cluster tree from the entire training data, we build a set of cluster trees from subsets of the training data set to form a decision cluster forest. Each tree in the forest is built from the subset of objects in the same class. The proposition for this method is that the objects in the same class tend to have their own spatial distributions in the data space. Therefore, decision clusters of objects in the same class are found. The decision clusters in the same tree have the same dominant class. In this way, no weak cluster is created in such decision cluster tree. A decision cluster model can be selected from the set of leaf decision clusters from the decision cluster forest so the model is called a decision cluster forest classification model (DCFC). The decision cluster forest method has advantages of classifying data with multiple classes because the DCFC model is guaranteed to contain decision clusters in all classes. DCFC model is a more intuitive and direct multi-class classification method. We propose a different classification method based on the tree structure. We propose a Crotch Ensemble classification model for high dimensional data with multiple classes. Generated from a decision cluster tree, a crotch is an inner node of the tree together with its direct children. If the dominant classes of children of a crotch are not all the same, the crotch is defined as a crotch predictor that is a classifier by itself. A crotch ensemble consists of a set of crotch predictors. When classifying a new object, a subset of crotch predictors is selected according to the distances between the object and the crotches. A classification is made on the object as the class predicted by the crotch predictors with the maximum accumulative weights. The experimental results on both synthetic and real data have shown that the Crotch Ensemble model is efficient and effective when classifying new samples. We propose a special application of our framework in text data classification. A subspace clustering algorithm is integrated to build the decision cluster tree. We adopt cosine distance metric for this application. Experimental results have shown that our framework can integrate different clustering algorithms and other possible methods and can get better classification results for text classification. Finally, we give the theoretical analysis of error bound of our DCC model. We prove that our Cluster-based classification model (DCC model) is better than the Object-based classification method.
Rights:	All rights reserved
Access:	open access

Files in This Item:

File	Description	Size	Format
b23930512.pdf	For All Users	2.72 MB	Adobe PDF	View/Open

Copyright Undertaking

As a bona fide Library user, I declare that:

I will abide by the rules and legal ordinances governing copyright regarding the use of the Database.
I will use the Database for the purpose of my research or private study only and not for circulation or further reproduction or any other purpose.
I agree to indemnify and hold the University harmless from and against any loss, damage, cost, liability or expenses arising from copyright infringement or unauthorized usage.

By downloading any item(s) listed above, you acknowledge that you have read and understood the copyright undertaking as stated above, and agree to be bound by all of its terms.

Show full item record

Please use this identifier to cite or link to this item: https://theses.lib.polyu.edu.hk/handle/200/5914