Author: Hu, Yanxing
Title: Feature representation for large-scale data set
Advisors: You, Jane (COMP)
Liu, N. K. James (COMP)
Degree: Ph.D.
Year: 2021
Subject: Machine learning
Algorithms -- Data processing
Data mining
Hong Kong Polytechnic University -- Dissertations
Department: Department of Computing
Pages: xviii, 175 pages : color illustrations
Language: English
Abstract: Feature representation is one of the most important research topics in Machine Learning (ML) area. In machine learning, representation of features means mapping the raw data into a new feature space that can be effectively exploited in machine learning tasks. Many supervised and unsupervised approaches, including supervised dictionary learning, Fuzzy and rough logics, Principal Component Analysis (PCA), local linear embedding, have been employed for feature representation of different types of data sets. The coming of the big data era brings both opportunities and challenges to the studies on feature representation. In real applications, the scale and the complexity of employed data far exceed the previous scenarios. On the one hand, the large volume of data set enables more complicate models be employed for feature representation, on the other hand, the multi-data source, complicate data structure and high computational requirement bring the new difficulties to the feature representation for huge data sets. In this study, concentrating on the feature representation problem for large-scale data set and related applications, new algorithms were proposed so that the obtained feature mapping enables better results for machine learning tasks. Our study starts with the feature representation for data set with discrete values. For data sets with discrete values, the features often contain some categorical information about the data points. This study solves the feature representation of this kind of data by providing a novel rough set-based feature reduction approach, to efficiently and reliably extract the necessary information in the features while removing the redundant information of the data set.
Our second work is to provide a matrix decomposition based unsupervised pre-training approach for the feature representation. One of the important unsupervised feature representations approach is based on clustering models. However, clustering approaches are time-consuming, especially for large-scale data sets. An eigenvector based unsupervised pre-training approach is therefore proposed for feature representation, and combined as the first layer of the Radial Basis Function Neural Network(RBFNN). Our third work concentrates on the feature representation for the data from multiple sources/views. A canonical correlation based-Auto encoder model is proposed for the feature fusion representation issue of the multi-domain data sets. The proposed model is consequently applied to the wind speed forecasting scenario to improve the wind speed forecasting accuracy. Finally, we proposed a localize generalization error based data reduction approach, this approach can reliably reduce the training set for some large-scale data set, which provide a thought for the large-scale learning takes. This approach is highly related to the distribution of the values for each feature, it can be seen from this work that the representation of the features can affect the necessary number of training samples. In summary, we make the following contributions: (i) algorithms and applications for feature representation on different types of large scale data sets; (ii) multi-domain feature fusion approach and applications; (iii) algorithms for computing the safe regions for the sum-optimal point notification problem.
Rights: All rights reserved
Access: open access

Files in This Item:
File Description SizeFormat 
5850.pdfFor All Users4.24 MBAdobe PDFView/Open


Copyright Undertaking

As a bona fide Library user, I declare that:

  1. I will abide by the rules and legal ordinances governing copyright regarding the use of the Database.
  2. I will use the Database for the purpose of my research or private study only and not for circulation or further reproduction or any other purpose.
  3. I agree to indemnify and hold the University harmless from and against any loss, damage, cost, liability or expenses arising from copyright infringement or unauthorized usage.

By downloading any item(s) listed above, you acknowledge that you have read and understood the copyright undertaking as stated above, and agree to be bound by all of its terms.

Show full item record

Please use this identifier to cite or link to this item: https://theses.lib.polyu.edu.hk/handle/200/11412