Text Classification Technologies in Document Categorization Systems. A Survey
Abstract
This paper presents a literature review from 2013 to 2022 on technologies and datasets used in the field of text classification. The review covered 110 sources from 5 scientific databases, the main criterion for inclusion was the presence of an experimental part involving a classifier or other technologies related to the classification process. The study reviewed the classification process and highlighted three main stages of text classification: data preparation, classifier training, and evaluation of results. Using Kitchenham's Systematic Literature Review methodology, scholarly articles dealing with text classification problems were collected and analyzed. A sample of 243 articles was obtained, and after screening, a resulting sample of 110 articles was obtained. Guided by the two research questions posed, this sample was analyzed and the results of the analysis were presented in graphical format. For each of the identified stages of classification, the frequency of use of the main technologies used in a particular stage was analyzed. Each technology was reviewed within its respective source. In addition, considerable attention was given to analyzing the different datasets used for text classification, with a particular focus on the less frequently used ones. An analysis of the frequency of use of datasets concluded that researchers often use proven and popular datasets to demonstrate the effectiveness of their method. Datasets are less frequently used to solve localized text classification problems. One notable trend identified in the analysis is the increasing prevalence of deep learning technologies in text classification. These technologies, including neural networks, recurrent neural networks (RNNs), convolutional neural networks (CNNs), transformers, and attention mechanisms, have gained considerable popularity among researchers. This study provides valuable insights into the evolution of text classification by shedding light on a variety of technologies, approaches, and datasets used by researchers. As text classification continues to evolve and diversify, this review can be a valuable resource for scholars and practitioners in the field, providing.