Author:  Li, Yan 
Title:  Building a decision cluster classification model by a clustering algorithm to classify large high dimensional data with multiple classes 
Degree:  Ph.D. 
Year:  2010 
Subject:  Hong Kong Polytechnic University  Dissertations Cluster analysis  Data processing Dimensional analysis  Data processing Computer algorithms Data mining 
Department:  Dept. of Computing 
Pages:  xv, 144 p. : ill. ; 30 cm. 
InnoPac Record:  http://library.polyu.edu.hk/record=b2393051 
URI:  http://theses.lib.polyu.edu.hk/handle/200/5914 
Abstract:  Clustering and classification are two basic tasks in data mining. As the complexity of data increases, the existing techniques for classification face a lot of challenges, for instance, classifying large high dimensional data with multiple classes. Therefore, new techniques need to be innovated to deal with data in large volume and high dimensions. In this thesis, we aim to propose a possible way to solve this problem by integrating clustering algorithm into classification work. We propose a new classification framework. This framework consists of three phases: (i) a clustering algorithm is called recursively to build a decision cluster tree, (ii) a classification model is built from this decision cluster tree, (iii) new samples are classified by this classification model. There are many research problems existing in this framework. In this thesis, we describe our methodology for those problems. In this framework, we propose a new classification method ADCC (Automatic Decision Cluster Classifier) that is designed to use a variable weighting kmeans algorithm Wkmeans to build a decision cluster tree so that the variable weights of each dimension can be obtained from the training data and used in classification. In partitioning the training data, Wkmeans automatically computes the variable weights according to the data distributions so that important variables can get more weights and the noisy variables get less weight. In clustering a data set (i.e., a node), the class variable is removed from the data, so the class variable has no impact on the clustering results. The class variable is used in determining the dominant class for each cluster. To build a better cluster tree, effective methods for selection of the number of clusters and the initial cluster centers at each node are introduced. Furthermore, we use various tests including AndersonDarling test to determine whether a node can be further partitioned or not. In this way, distribution of the training samples at each node is considered together with the purity and the size of the node. A decision cluster classifier consists of a set of disjoint decision clusters, each labeled with a dominant class that determines the class of new objects falling in the cluster. A series of experiments on both synthetic and real data sets have been conducted. The results show that the new classification method (ADCC) performed better in accuracy and scalability than the existing methods of KNN, decision tree and SVM. It is particularly suitable for large, high dimensional data with many classes. Sometimes, ADCC method generates some weak decision clusters in which no single class dominates. Existence of weak decision clusters in the model can affect classification performance of the model. In a weak decision cluster, there is no dominant class, so it is difficult to justify the class of the new objects. It has been shown that classification accuracy could be improved after weak decision clusters were avoided from the model. Weak decision clusters occur because objects of different classes are mixed in the clustering process to generate decision clusters. If we assume that objects in the same class have their own cluster distributions, we can separate objects of different classes according to the object class labels and generate a decision cluster tree for each class of objects. Then, we combine the decision clusters of different classes to form the decision cluster classification model. In this way, weak decision clusters can be avoided. We propose a Decision Cluster Forest (DCF) method to build a set of decision cluster trees (decision cluster forest) which form a classification model. Instead of building a single decision cluster tree from the entire training data, we build a set of cluster trees from subsets of the training data set to form a decision cluster forest. Each tree in the forest is built from the subset of objects in the same class. The proposition for this method is that the objects in the same class tend to have their own spatial distributions in the data space. Therefore, decision clusters of objects in the same class are found. The decision clusters in the same tree have the same dominant class. In this way, no weak cluster is created in such decision cluster tree. A decision cluster model can be selected from the set of leaf decision clusters from the decision cluster forest so the model is called a decision cluster forest classification model (DCFC). The decision cluster forest method has advantages of classifying data with multiple classes because the DCFC model is guaranteed to contain decision clusters in all classes. DCFC model is a more intuitive and direct multiclass classification method. We propose a different classification method based on the tree structure. We propose a Crotch Ensemble classification model for high dimensional data with multiple classes. Generated from a decision cluster tree, a crotch is an inner node of the tree together with its direct children. If the dominant classes of children of a crotch are not all the same, the crotch is defined as a crotch predictor that is a classifier by itself. A crotch ensemble consists of a set of crotch predictors. When classifying a new object, a subset of crotch predictors is selected according to the distances between the object and the crotches. A classification is made on the object as the class predicted by the crotch predictors with the maximum accumulative weights. The experimental results on both synthetic and real data have shown that the Crotch Ensemble model is efficient and effective when classifying new samples. We propose a special application of our framework in text data classification. A subspace clustering algorithm is integrated to build the decision cluster tree. We adopt cosine distance metric for this application. Experimental results have shown that our framework can integrate different clustering algorithms and other possible methods and can get better classification results for text classification. Finally, we give the theoretical analysis of error bound of our DCC model. We prove that our Clusterbased classification model (DCC model) is better than the Objectbased classification method. 
Files  Size  Format 

b23930512.pdf  2.781Mb 


As a bona fide Library user, I declare that:  


By downloading any item(s) listed above, you acknowledge that you have read and understood the copyright undertaking as stated above, and agree to be bound by all of its terms. 