Unsupervised pattern discovery for sequence and mixed attribute databases

Wu, Pak-kit

Author:	Wu, Pak-kit
Title:	Unsupervised pattern discovery for sequence and mixed attribute databases
Degree:	M.Phil.
Year:	2011
Subject:	Database management. Database searching. Pattern recognition systems. Hong Kong Polytechnic University -- Dissertations
Department:	Department of Computing
Pages:	xi, 143 leaves : ill. ; 30 cm.
Language:	English
Abstract:	That the world contains a vast amount of digital information getting ever vaster ever more rapidly, there is a great need to reveal new insights which previously remain hidden from the data of mixed data types such that comprehensive information could be well structured, effectively organized and further applied to analysis, classification, interpretation, understanding and summarization. As most data from databases come from diverse sources, many of them are not necessarily provided with explicit class information. A pattern discovery method which automatically discovers pattern and knowledge from data without relying on prior classificatory knowledge is in great need. For a large database, how to discover statistically significant patterns and how to discretize its continuous data into interval events are still research and practical problems. Discovering patterns from a large mixed-mode database, where these data types may be a mixture of interval-scaled, symmetric binary, asymmetric binary, category, ordinal or ratio-scaled, is regarded as a classification problem when classes of the samples are given and solved as a discrete-data problem by discretizing the continuous data into intervals maximizing the interdependence between that attribute and the class labels. However, when class information is unavailable, discovering patterns becomes difficult. To tackle the aforementioned problems in an unsupervised manner, which is the problem of unsupervised pattern discovery, one would search for statistically significant patterns by mining the database. The proposed approach adopts a probabilistic approach to detect statistically significant patterns and transform them into a relational table to represent the original data. Given a mixed-mode dataset, we partition it into a number of attribute clusters, each of which contains some sort of correlated relationship. This process is known as attribute clustering. Once all optimal attribute clusters are found, the most representative attribute so-called mode could be discovered in each attribute cluster. To deal with the discretization problem, a mode-driven discretization algorithm is introduced to treat the mode just like the class label to drive the discretization of other continuous attributes in the attribute group by maximizing the interdependence between the continuous attributes and the mode. Treating intervals as discrete events, association patterns can be discovered. If the attribute clusters obtained are crisp clusters, significant patterns overlapping different clusters cannot be found. A new method of "fuzzifying" the crisp attribute clusters is introduced to detect significant patterns which overlap different fuzzy clusters. In validating the premises proposed in the thesis, extensive experiments using a number of synthetic data sets, data sets from UCI machine learning archive and two large sets from real world databases were conducted to verify each of the questions conceived. In particular to demonstrate the usefulness of the proposed approach, the two large sets of real world data are chosen to be analyzed: one is from a number of meteorological surface stations while another one is from a delay coking unit in a petrochemical refinery. The discovery of patterns from the data of weather stations reflects the local and global characteristics of the correlated meteorological parameters. The finding from the data of the delay coking reveals the relationship among the large number of sensors and controllers of the coking plant facilities. These findings provide significant evidences to support the usefulness and effectiveness of the proposed approaches in analyzing the data to extract significant patterns and knowledge for interpretation, understanding and summarization.
Rights:	All rights reserved
Access:	open access

Files in This Item:

File	Description	Size	Format
b24562063.pdf	For All Users	3.44 MB	Adobe PDF	View/Open

Copyright Undertaking

As a bona fide Library user, I declare that:

I will abide by the rules and legal ordinances governing copyright regarding the use of the Database.
I will use the Database for the purpose of my research or private study only and not for circulation or further reproduction or any other purpose.
I agree to indemnify and hold the University harmless from and against any loss, damage, cost, liability or expenses arising from copyright infringement or unauthorized usage.

By downloading any item(s) listed above, you acknowledge that you have read and understood the copyright undertaking as stated above, and agree to be bound by all of its terms.

Show full item record

Please use this identifier to cite or link to this item: https://theses.lib.polyu.edu.hk/handle/200/6189