Imbalanced Data Oversampling Technique Based on Convex Combination Method

Elnahas, Mohammed Moustafa; Hussein, Mahmoud; Keshk, Arabi

doi:10.21608/ijci.2021.72508.1047

Imbalanced Data Oversampling Technique Based on Convex Combination Method

Document Type : Original Article

Authors

¹ cs department faculty of computer and information Menoufia University

² Computer Science Department, Faculty of Computers and Information, Menoufia University

³ Faculty of Computer and Information Menoufia University

10.21608/ijci.2021.72508.1047

Abstract

Classification process is the predicting a label for a specific set of inputs. In such process, it is difficult to classify given inputs when a dataset is imbalanced. Most of existing machine learning classifiers suffer from dealing with the imbalanced data, because it makes the classifiers highly biased towards the majority class. This bias may lead to less accuracy in minority class prediction. Data oversampling is one of the most important solutions used to balance the data particularly when dataset is small and/or imbalanced dataset. Synthetic Minority Over-sampling Technique (SMOTE), Borderline-SMOTE, Adaptive Synthetic (ADASYN) and Weighted SMOTE(W-SMOTE) are the most popular techniques used for data oversampling. However, the main drawback of SMOTE and ADASYN techniques is they increase the overlapping between classes and then the produced samples are not representative of the original data distribution. The Borderline-SMOTE may neglect some important samples to produce new samples. To overcome, the problems in the existing over-sampling techniques, in this paper, we propose a new data over-sampling method that depends on the convex combination method to generate new samples of the minority class. The convex combination allows us to produce new samples that have the same original data distribution. We evaluated our approach over four standard imbalanced datasets (Yeast, Glass Identification, Paw, and Wisconsin Prognosis Breast Cancer (WPBC)). The experimental results show that our proposed method gives better performance in terms of accuracy, precision, recall. F1-measure and Area under the curve (AUC).

Keywords