A Comparative Study for Arabic Text Classification Based on BOW and Mixed Words Representations

Document Type : Original Article

Authors

1 Faculty of Applied Sciences, Taiz University, Yemen

2 Faculty of Computers and Information Menoufia University

3 Faculty of Computers and Information, Menofia University, Egypt

Abstract

This paper compares two methods for features representation in Arabic text classification. These methods are bag of words (BOW) that mean the word-level unigram and mixed words representations. The mixed words use a mixture of a bag of words and two adjacent words with different proportions. The main objective of this paper is to measure the accuracy of each method and to determine which method is more accurate for Arabic text classification based on the representation modes. Each method uses normalization and stemming. The results show that the use of mixed words in features representation achieves the highest accuracy by 98.61% when normalization is used.

Keywords