C5050: An Efficient Framework for Author Identification Using Deep Learning

Document Type : Original Article


1 Department of Computer science, Faculty of Computers and Information, Menofia University, Shebin El Kom, Egypt

2 Computer Science, Faculty of Computers and Information, Menoufia University

3 Computer Science Dept, Faculty of Computers and Information, Menoufia University, Egypt.


Author identification aims to uncover the individuals responsible for creating texts, and it is a burgeoning field of research with diverse applications in literary analysis, cybersecurity, forensics, and social media investigations. The primary goal of this paper is to perform an analysis on author identification. We introduce two main elements within this study. The initial element utilizes six machine learning (ML) techniques: Decision Trees (DT), Logistic Regression (LR), k Nearest Neighbors (K-NN), Random Forests (RF), Support Vector Machines (SVM), and Naive Bayes (NB), with the application of the TF-IDF method for feature extraction. The second part involves the experimentation with two variations of Deep Learning (DL) models—specifically Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU)—employing word embedding for the input vector. To validate our approach, we conducted an experimental study using the Reuters 50_50 dataset, employing two learning modes: Hold-out and 10-fold cross validation. The obtained results, measured in terms of Accuracy (ACC), Precision (PREC), Recall (REC), and F1-score (F1), demonstrate the superior performance of DL techniques when employing a 10-fold cross-validation strategy compared to the current state-of-the-art methods. The experiments detailed in this paper showcase the efficacy of our proposed DL models, yielding the best results for author identification.