Author: Liu, Yongxu
Title: Generalizing oversampling methods from euclidean space data to graph structured data for the class imbalance problem
Advisors: Liu, Yan (COMP)
Degree: Ph.D.
Year: 2022
Subject: Computer algorithms
Pattern recognition systems
Hong Kong Polytechnic University -- Dissertations
Department: Department of Computing
Pages: xv, 110 pages : color illustrations
Language: English
Abstract: Over the last decade, classification, one of the most important pattern recognition tasks, has promoted a wide range of real-world applications. Although many classic classification algorithms have been proposed, some of them may not perform well when data follows the unequal distribution, where some minority classes have much fewer samples than the majority classes. By generating new samples for the minority class, oversampling methods propose a promising solution to alleviate the class imbalance problem. There are still two concerns: the noise generated by oversampling methods and the narrow scope that oversampling methods focus on. First, the oversampling algorithms utilize the interpolation method to generate samples between the chosen source samples. If a minority sample is far from its minority neighbors, the interpolated samples would easily reside in the majority region and cause additional classification difficulties. Second, most of the existing oversampling methods focus on the numerical variables and binary classification. Nevertheless, categorical variables and multi-class classification are ubiquitous in the real world. Extending existing methods to the complex scenario would meet several non-trivial challenges. Furthermore, we delve deeper into oversampling algorithm designs and observe that most of them assume the data falls into the Euclidean space. It severely limits the scope of existing oversampling methods. Non-Euclidean space data, such as graph-structured data, arise in numerous applications.
Thus, this thesis progressively proposes three steps to alleviate the above issues, named Generalizing Oversampling Methods from Euclidean Space Data to Graph Structured Data for the Class Imbalance Problem. First, to reduce the noises of generated samples, we propose PABIO, an efficient position-aware safe boundary interpolation oversampling algorithm on Euclidean space data. We utilize a combined clustering algorithm, which would not cluster two dense clusters into one. Then we can safely generate new samples within the discovered clusters. Moreover, we leverage the majority class information to learn a safe boundary for generating samples. We force the synthetic samples closer to their minority neighbors than to any other majority instances in the embedding space. It avoids the generation of noisy samples, especially for the minorities far away from their minority neighbors.
Second, we propose NROMM, a noise-robust oversampling algorithm, for the categorical variables and multi-class data with an underlying Euclidean space. NROMM uses a heterogeneous distance metric to calculate the difference between samples with numerical and categorical variables. One-versus-ensemble decomposition method is proposed to binarize the multi-class problem. Except for the class with the maximum number of samples, we oversample one class at a time. We ensemble the original and synthetic samples from larger classes than the class being oversampling as the majority class. We propose a cleaning strategy to alleviate the aggregated class overlapping problem in the multi-class scenario. After all the classes have been oversampled, we discover the overlapping sets and then remove the inland samples in these sets to enrich the class boundaries.
Third, we propose GATSMOTE to enable the oversampling algorithm to work on one of the non-Euclidean spaces, graph-structured data. We focus on adding new edges between synthetic samples and the original graph, while preserving the locality graph structure and shortening the path length. We utilize the attention mechanism to calculate the weighted edge connections adaptively. The attention coefficients are only calculated within a given neighborhood, and thus more local graph structures are preserved. Moreover, we follow the homophily in graph theory to add edges between similar nodes. We propose to add edges between nodes sharing similar feature vectors. In addition, we propose to add edges between nodes within each class. Adding edges between nodes with identical labels shortens the path length to facilitate message passing via edges.
Rights: All rights reserved
Access: open access

Files in This Item:
File Description SizeFormat 
6647.pdfFor All Users4.43 MBAdobe PDFView/Open


Copyright Undertaking

As a bona fide Library user, I declare that:

  1. I will abide by the rules and legal ordinances governing copyright regarding the use of the Database.
  2. I will use the Database for the purpose of my research or private study only and not for circulation or further reproduction or any other purpose.
  3. I agree to indemnify and hold the University harmless from and against any loss, damage, cost, liability or expenses arising from copyright infringement or unauthorized usage.

By downloading any item(s) listed above, you acknowledge that you have read and understood the copyright undertaking as stated above, and agree to be bound by all of its terms.

Show full item record

Please use this identifier to cite or link to this item: https://theses.lib.polyu.edu.hk/handle/200/12239