A Comparative Study for Different Resampling Techniques for Imbalanced datasets

Document Type : Original Article




The imbalanced data is a significant challenge for

researchers in supervised machine learning. Current data mining algorithms are not effective for processing imbalanced data.

In fact, this problem reduces classification accuracy because the

prediction of minority classes is inaccurate. The classification

of imbalanced data is the major challenge that has received

significant attention. Therefore, The use of sampling techniques

to improve classification performance has been a significant

consideration in related work. In this paper, a comparative

study of six different sampling algorithms is performed. The

employed sampling algorithms are from different sampling

techniques: two oversampling algorithms, two undersampling

algorithms, and two combination algorithms between oversampling and undersampling. The techniques used in oversampling

are random oversampling and SMOTE, while undersampling

techniques are random undersampling and a near miss. A

combination of oversampling and undersampling techniques

is SMOTE TOMEK and SMOTEEN. This comparative study

aims to examine the impact of the employed sampling method.

Algorithms on the performance of three classifiers: SVM, KNN,

and logistic regression. Cross-validation experiments on 12

standard datasets show that the SMOTEEN sampling The

algorithm achieves significant improvements compared with

other typical algorithms.