0% found this document useful (0 votes)
11 views

A Complete Process of Text Classification System Using State‐of‐the‐Art NLP Models

The article reviews the complete process of text classification systems utilizing advanced NLP models, highlighting the growth of digital textual data and the need for effective classification techniques. It discusses various machine learning and deep learning algorithms, their advantages and limitations, and outlines the subtasks involved in text classification. The paper aims to provide insights into existing methods and propose new techniques for improved text classification across different domains.

Uploaded by

zakimohan21
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

A Complete Process of Text Classification System Using State‐of‐the‐Art NLP Models

The article reviews the complete process of text classification systems utilizing advanced NLP models, highlighting the growth of digital textual data and the need for effective classification techniques. It discusses various machine learning and deep learning algorithms, their advantages and limitations, and outlines the subtasks involved in text classification. The paper aims to provide insights into existing methods and propose new techniques for improved text classification across different domains.

Uploaded by

zakimohan21
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Hindawi

Computational Intelligence and Neuroscience


Volume 2022, Article ID 1883698, 26 pages
https://doi.org/10.1155/2022/1883698

Review Article
A Complete Process of Text Classification System Using
State-of-the-Art NLP Models

Varun Dogra ,1 Sahil Verma ,2,3 Kavita ,2,4 Pushpita Chatterjee ,5 Jana Shafi ,6
Jaeyoung Choi ,7 and Muhammad Fazal Ijaz 8
1
School of Computer Science and Engineering, Lovely Professional University, Phagwara, Punjab, India
2
Department of Computer Science and Engineering, Chandigarh University, Mohali 140413, India
3
Bio and Health Informatics Research Lab, Chandigarh University, Mohali 140413, India
4
Machine Learning and Data Science Research Lab, Chandigarh University, Mohali 140413, India
5
Tennessee State University, Nashville, TN, USA
6
Department of Computer Science, College of Arts and Science, Prince Sattam Bin Abdul Aziz University,
Wadi Ad-Dwasir 11991, Saudi Arabia
7
School of Computing, Gachon University, Seongnam-si 13120, Republic of Korea
8
Department of Intelligent Mechatronics Engineering, Sejong University, Seoul 05006, Republic of Korea

Correspondence should be addressed to Jaeyoung Choi; [email protected] and Muhammad Fazal Ijaz; [email protected]

Received 4 March 2022; Revised 20 April 2022; Accepted 9 May 2022; Published 9 June 2022

Academic Editor: Sumarga Kumar Sah Tyagi

Copyright © 2022 Varun Dogra et al. This is an open access article distributed under the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
With the rapid advancement of information technology, online information has been exponentially growing day by day, especially
in the form of text documents such as news events, company reports, reviews on products, stocks-related reports, medical reports,
tweets, and so on. Due to this, online monitoring and text mining has become a prominent task. During the past decade,
significant efforts have been made on mining text documents using machine and deep learning models such as supervised,
semisupervised, and unsupervised. Our area of the discussion covers state-of-the-art learning models for text mining or solving
various challenging NLP (natural language processing) problems using the classification of texts. This paper summarizes several
machine learning and deep learning algorithms used in text classification with their advantages and shortcomings. This paper
would also help the readers understand various subtasks, along with old and recent literature, required during the process of text
classification. We believe that readers would be able to find scope for further improvements in the area of text classification or to
propose new techniques of text classification applicable in any domain of their interest.

1. Introduction stock prices volatility [4]; healthcare: disease surveillance


[5, 6]; politics: developing a probabilistic framework on
In recent years, we have seen a growth in the amount of politics using short text classification [7]; education: un-
digital textual data available, which has generated new derstanding pedagogical aspects of the learners [8]; tourism:
perspectives and so created new areas of research. With the analyzing travelers sentiments [9]; and e-commerce: pre-
emergence of information technology, the monitoring of dicting success by evaluating users’ reviews [10].
such digital textual data is of great importance in many areas News is widely available in electronic format on the
such as the stock market: gathering data from news sources World Wide Web these days, and it has proven to be a
to forecast the movement of underlying asset volatility [1], valuable data source [11]. The volume of news, on the other
forecasting the stock prices of green firms in emerging hand, is enormous, and it is unclear how to use it most
markets [2], understanding the impact of tone of commu- efficiently for domain-specific research. Therefore, a
nications on stock prices [3], and determining indicators for framework or architecture is required for a domain-specific
8483, 2022, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.1155/2022/1883698 by Somalia Hinari NPL, Wiley Online Library on [13/01/2025]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
2 Computational Intelligence and Neuroscience

news monitoring system, as well as a classification mecha-


nism for classifying relevant online news into distinct subject Internet
groups automatically [5]. News monitoring is a type of
oversight system that monitors and ensures the quality of
each news instance generated, used, and retained for a Data monitoring and downloading
purpose. Processes for assessing news to guarantee its from news-wires or company’s
completeness, consistency, and correctness, as well as se- websites using web crawler code
curity and validity, are included in these methods. /ere is or monitoring tools
always a need for a methodology that can extract meaningful
information from a pool of textual documents belonging to Important data extraction
distinct subject groups intended for certain research as using pattern matching
shown in Figure 1. from web pages
Indeed, the majority of the digital data are available in
the form of text, but this is usually unstructured or semi-
Information
structured [12]. /us, to make data useful for decision- Storage in
making, structuring this textual data became a necessity different formats
[13, 14]. However, because of the high volume of data, it is
quite impossible to process the data manually. Text classi- Figure 1: Monitoring and downloading relevant text documents to
fication has evolved due to this challenge. It is defined as subject groups.
assigning the text documents to one or more categories
(called labels) according to their content and semantics.
Traditionally, the majority of classification tasks were used
Business
solved manually, but it was expensive to scale. Classification
can be thought of as writing rules for assigning a class to
similar text documents. /ese rules include some related
information that identifies a class. Handwritten rules can be Text Technology
performed well, but creating and maintaining them over documents
time requires much manpower. A technical expert can frame
rules by writing regular expressions that could maximize the
accuracy of the classifier. /e existing studies have proposed Sports
various techniques to automatically classify text documents
using machine learning [15, 16]. In this approach, the set of
Figure 2: Labeling text documents with appropriate predefined
rules or criteria for selecting a classifier is learned auto- classes or labels during the process of text classification.
matically from the training data. Under each class, it requires
a lot of training documents and expertise to label the
documents. /e labeling is a process of assigning each
document to its associated class. /e labeling process was In recent decades, models of machine learning have
easier than writing handcrafted rules. Moreover, there exist attracted a lot of interest [19, 20]. Most conventional models
variously supervised and semisupervised learning tech- based on machine learning follow the common two-step
niques that can even reduce the burden of manual labeling method, where certain features are extracted from the text
[17, 18]. /is can be performed using automatic labeling. documents in the first step, and those features are fed to a
Automated text classification methods can be divided into classifier in the second step to make a prediction. /e popular
three groups: rule-based methods, data-driven methods, and feature representation models are BOW (bag-of-words), TF-
hybrid methods. IDF (term frequency-inverse document frequency), and so
Using a set of predefined rules, rule-based techniques on. And the common classifiers are naı̈ve Bayes, KNN, SVM,
classify text into various categories as shown in Figure 2. For decision trees, random forests, and so on. /ese models are
example, the “fruit” label is applied to any document with discussed in detail in the following sections. Deep learning
the words “apple,” “grapes,” or “orange.” /ese techniques models have been applied to a wide variety of tasks in NLP,
require a thorough knowledge of the domain, and it is improving language modeling for more extended context
difficult to maintain the systems. Data-driven methods, on the [21–23]. /ese models are attempting, in an end-to-end
other hand, learn to make classifications based on previous fashion, to learn the feature representations and perform
data values. A machine learning algorithm can learn the classification. /ey not only have the potential to uncover
inherent associations between pieces of text and their labels latent trends in data but also are far more transferable from
using prelabeled examples as training data. It can detect one project to another. Quite significantly, in recent years,
hidden patterns in the data, is more flexible, and can be these models have become the mainstream paradigm for the
applied to different tasks. As the title indicates, hybrid ap- various tasks of text classification. /e following are some of
proaches use a mixture of rule-based and machine learning the natural language challenges solved with the text
methods (data-driven) for making predictions. classification.
8483, 2022, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.1155/2022/1883698 by Somalia Hinari NPL, Wiley Online Library on [13/01/2025]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Computational Intelligence and Neuroscience 3

Topic modeling is widely used to extract semantic in- review studies available that cover in-depth presentations of
formation from text data. An unsupervised method of topic various subtasks of text classification.
modeling learns the collection of underlying themes for a In this paper, we focus to overcome the above-men-
batch of documents as well as the affinities of each document tioned issues. We put a lot of effort to create qualitative
to these topics. research for text classification to help us understand its
News classification: online news reporting is one of the subtasks or elements. Moreover, this paper presents the old
most significant sources of information. /e task of finding and latest techniques used in each subtask of text classifi-
and deriving structured information about news events in cation as shown in Figure 3 along with their benefits and
any text and assigning the relevant label is referred to as news limitations. It also presents the research gap in the area of
classification. text classification by examining various existing studies. /e
Sentiment classification is an automatic technique of key contribution of the study is mentioned below:
discovering views in text and classifying them as negative,
(i) Discussing the subtasks of text classification
positive, or neutral based on the emotions expressed in text.
Sentiment classification, which uses NLP to evaluate sub- (ii) Presenting the most recent and former techniques
jective data, can help understand how people think about used in each subtask
company’s products or services. (iii) Presenting benefits and limitations of various
Question answering has rapidly evolved as an NLP models used in the process of text classification
challenge that promises to deliver more intuitive means of (iv) Presenting the research scope for further im-
knowledge acquisition. In contrast to the typical information provements in existing techniques and proposing
retrieval approach of creating queries and perusing results, a new techniques with their application in different
question answering system simply takes user information domains
requests stated in ordinary language and returns with a brief
answer. Section 2 presents the process of text classification along
Language translation models have been attempting to with the comprehensive literature on each subtask; Section 3
translate a statement from one language to another resulting presents the evaluation methods of classification techniques;
in perplexing and offensively inaccurate results. NLP al- Section 4 presents the comparison of approaches or models
gorithms through text classification may be trained on texts used in the subtasks of the text classification system men-
in a variety of languages, allowing them to create the tioning their benefits and limitations; Section 5 presents the
equivalent meaning in another language. /is approach is research gap and further scope for research; and Section 6
even applicable to languages such as Russian and Chinese, concludes the existing studies.
which have historically been more difficult to translate due
to differences in alphabet structure and the use of characters 2. Text Classification: Framework
rather than letters, respectively.
Nevertheless, it is observed that most text classification Text classification is a problem formulated as a learning
literature studies for solving NLP challenges are limited to process where a classifier is used to train to differentiate
showcasing the results of text classification using standard or between predefined classes based on features extracted from
state-of-the-art methods and focusing on specific research the collection of text documents [29]. /e accuracy of the
domains. For example, the authors mention the application classifier depends upon the classification granularity and how
of text analytics in the industry, but the task of monitoring well separated are the training documents among classes
and collecting text data was not detailed, and the scope of the [30, 31]. In text classification, a set of labels or classes are
proposed models appeared limited to particular domains [24]. given, and we need to evaluate which class/label a particular
In another study, the authors discuss the information ex- text document relates to. Usually, a class or label is a general
traction from tweets for monitoring trucks fleets to model topic such as sports or business. But it may be significantly
truck trips, but it does not cover the feature selection or more difficult to distinguish between documents that are
extraction methods to achieve information extraction [25]. about more similar classes such as networks and the internet
Other studies [26, 27] focus on text classification for domain- of things. Certain features represent the potential overlap
specific search engine based on rule-based annotated data; between classes; the learning task would be simplified by
however, it does not cover the semisupervised or unsuper- removing such overlapping features. If the gap between
vised approaches of labeling data to achieve text classification classes could be increased, the classification performance
[28]. Moreover, these works do not reveal the latest tech- would increase. /is can be achieved through features
niques being used in the area of natural language processing. weighting and selecting valuable features. Text classification
/e deep learning-based pretrained language representation has been studied and applied by many researchers in real-
model can be explored in information extraction and clas- world scenarios such as sentiment classification of stock
sification. /ese studies also lack in detailing the subtasks market news and its impact [31], news classification for
require to initiate the research in text classification, that is, syndromic surveillance [5], microblog topic classification
data collection, data preprocessing, and semisupervised or [32], domain adaptation for sentiment classification [33, 34],
unsupervised data labeling for training machine learning and brand promotion based on social media sentiments
models. To the best of our knowledge, there are no similar [35, 36].
8483, 2022, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.1155/2022/1883698 by Somalia Hinari NPL, Wiley Online Library on [13/01/2025]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
4 Computational Intelligence and Neuroscience

Text-Representation Dimensionality
Reduction

Bag-Of-Words
Feature Machine/ Deep
Selection Learning
methods Models for
Data TF-IDF Classification
Collection

Word Embedding
Feature
Extraction
methods Model
Sentence Evaluation
Embedding

Figure 3: Subtasks of the text classification process cover state-of-the-art data collection, text representation, dimensionality reduction, and
machine learning models for classifying text documents to an associated predefined class/label.

/e text classification process is described as classifying a different approaches have been used in each phase of text
set of N documents; first, we build a classifier T. /ere is a classification discussed in the next subsections of the study.
collection of text documents D, and every text document is
given a class/label by an expert. Secondly, we need to train a
2.1. Data Collection. /e initial stage in text classification is
classifier for each class/label by giving input as a corre-
to acquire text data from different sources as per the research
sponding set of documents in D. Now we need to apply
domain. /ere are several online open data sets available, for
trained classifier C to classify N documents. We will get each
example, various newsgroups (Bloomberg, Reuters, Finan-
document in N assigned to a predefined class/label by C. Text
cial Express), Kaggle, and WebKB for solving a classification
classification is a comprehensive process that includes not
problem. Researchers have used such database architecture
just model training but also several other steps including
for their research purposes [37–39]. /e corpus can also be
data preprocessing, transformation, and dimensionality
built with data that could be anything from emails, language
reduction. /e process starts with the collection of textual
articles, company’s financial reports, medical reports, to news
content from various sources. /e textual content may be
events. In the study, the authors have created a fine-grained
belonging to a domain(s) representing some events, business
sentiment analysis corpus for annotating product reviews.
processes, or public information. /en these text documents
However, they faced the most challenging tasks that had not
require preprocessing to generate appropriate text repre-
been targeted in applications such as sentiment analysis,
sentation for the learning model. /is is done in two phases:
target-aspect pair extraction, and implicit polarity recogni-
in phase 1, the features are extracted from the processed text
tion, for recognizing aspects and searching polarity with
using any feature extraction algorithm, and in phase 2, the
nonsentiment sentences [40].
features are reduced by applying feature selection tech-
niques. /is reduction of features tends to decrease the
dimensions of data required for the learning method. After 2.2. Text Document Representation: Features Construction
these phases, the learning algorithms are chosen to train on and Weighting. Text classification is the most demanding
data to generate the best classifier for recognizing a target area of machine learning for understanding texts written in
category or class. /is text data required to train a classifier is any natural language. One of the most essential tasks that
known as training data. /e data is divided into two sets: the must be completed before any classification process is text
majority of data are taken for the training model, and the rest representation. Moreover, the texts cannot be provided as
part of the data is taken for testing the classifier, known as input to the machine learning models because almost all
testing data. Similarly, the model is trained to recognize each algorithms take input in numbers as feature vectors with a
target class representing its data available in the associated predefined size instead of the textual data with variable length.
text documents. During the testing phase, when a classifi- To resolve this issue, first textual data need to be transformed
cation method is developed, it is executed on test data to to document vectors. /is can be done in two different ways
define the target class of input text, and the result is pro- in general. /e first is a context-independent approach in
duced in the form of weights or probabilities. Finally, the which a document is represented as a set of terms with their
result is evaluated for its accuracy, of text classifier, using corresponding frequency in the document, but they are in-
evaluation techniques. /ese are the main phases or subtasks dependent of the sequence of terms in the collection. /e
of the text classification process, also shown in Figure 4. /e second approach is to represent text as strings, with each
8483, 2022, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.1155/2022/1883698 by Somalia Hinari NPL, Wiley Online Library on [13/01/2025]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Computational Intelligence and Neuroscience 5

Raw text Partition Training Test data


data data

Text documents representation;


Feature Selection

Dimensionality reduction;
Feature Extraction

Classifier model

Evaluation

Figure 4: A text classification framework. Note: Black connecting lines represent training and blue connecting lines represent the testing
phase.

document consisting of a sequence of terms. /e following documents may consist of a large number of meaningful words
subtopic covers the various representations in natural lan- that leads to the problem of scalability or high dimensionality.
guage processing from the early days to the latest state-of-the- /is offers opportunities to find effective ways to decrease
art models. running time or reduce high dimensionality in the case of a
large number of documents.
/e primary alternative has emerged in the form of
2.2.1. Context-Independent Approaches. /e bag-of-words statistical language modeling for modeling complex text
[41] is the most commonly used model in document classification or other natural language tasks. In its begin-
transformation that considers every word in the text doc- ning, however, it used to struggle with the curse of di-
ument as a token/feature although words’ order and their mensionality when studying typical probability functions of
context are ignored. Each word, sometimes tens or hundreds language models [48]. /is led to the inspiration to learn
of dimensions, is represented by a real-valued vector called a distributed representations of low-dimensional space terms.
one-hot representation [42]. /e feature vector has the same /e distributed representations describe a co-occurrence
length as the vocabulary size, and only one dimension is on matrix of terms × terms that considers the frequency of each
as shown in Figure 5. However, the one-hot representation term that appears in the context of another term, with a
of a word suffers from data sparsity. On the other hand, the window size of k [49]. /e singular value decomposition was
words with very high frequency may cause biases and used for text representation, where the matrix decomposi-
dominate results in the model [44]. To overcome the tion technique was used for reducing a given matrix to its
weaknesses of BOW, the text documents are represented with constituent matrices via an extension of the polar decom-
weighted frequencies, a document-term matrix, where a position with the idea of making subsequent matrix cal-
column signifies a token and a row signifies a document. /is culations simpler. It gives the top rank-k constituent parts of
scheme of assigning weights to token frequencies in the form the original data. /e singular value decomposition will
of a matrix is called TF-IDF (term frequency-inverse docu- break this into best rank approximation capturing infor-
ment frequency). During the implementation of this model mation from most relevant to least relevant ones [49].
using a matrix, the value wij in each cell corresponds to the /e big popularization of word embedding was possibly
weight of tj in di that is calculated as tf ⟨tj , di │ ndi ⟩, where due to the continuous bag-of-words (CBOW) paradigm to
tf ⟨tj , di ⟩ represents the count of token tj in text doc- create high-quality distributed vector representations ef-
ument and di and ndi represents the total quantity of token tj fectively. A solution is designed to counter the curse of
in document di . Due to the simplicity of the model, this is dimensionality where a distributed representation for each
preferably used in natural language processing. /e improved word is concurrently studied along with the probability
features subset using this approach has been taken together distribution for word sequences represented in terms of such
with the characteristics of term frequency and document representations [21]. /e continuous bag-of-words is a
frequency [45–47]. However, even a small collection of prediction-based model that directly learns word
8483, 2022, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.1155/2022/1883698 by Somalia Hinari NPL, Wiley Online Library on [13/01/2025]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
6 Computational Intelligence and Neuroscience

The quick brown fox jumped over the brown dog of co-occurrence probabilities. /e prediction-based ap-
proach receives popularity, but GloVe’s authors claim that the
count-based methodology incorporates the global statistics

kangaroo
and may be more effective because it outperforms word

jumped
brown

house
quick

over
representation testing on word comparison, term similarity,

flew
bird
dog
Fox
the
cat

and called entity recognition tasks.


0 1 0 0 0 0 0 0 0 0 … 0 0
/e enhanced text document representation system was
0 0 1 0 0 0 0 0 0 0 … 0 0 developed to work on the issues of traditional feature-based
0 0 0 1 0 0 0 0 0 0 … 0 0 extraction techniques that included only nouns and nouns
0 0 0 0 1 0 0 0 0 0 … 0 0 phrases to represent the important events called event de-
time

0 0 0 0 0 1 0 0 0 0 … 0 0 tection techniques [57]. /is technique has used fewer to-


0 0 0 0 0 0 1 0 0 0 … 0 0 kens or features than bag-of-words to handle the problem of
0 1 0 0 0 0 0 0 0 0 … 0 0 scalability or high dimensionality of documents. Further-
0 0 0 1 0 0 0 0 0 0 … 0 0 more, this technique has led to another representation based
0 0 0 0 0 0 0 1 0 0 … 0 0 on named entities. /e authors have presented the classi-
fication of tweets using named entity recognition to filter out
Dictionary Size noiseless required information [30, 58]. It works by finding
Figure 5: One-hot representation, a tensor that is used to represent proper nouns in the documents that belong to predefined
each document. Each document tensor is made up of a potentially categories. /is process involves systematically assigning
lengthy sequence of 0/1 vectors, resulting in a massive and sparse categories to each term or entity while developing a corpus
representation of the document corpus [43]. or labeling process. But this corpus-based representation
was unable to represent some domain-specific words that are
infrequent during training. /e authors have proposed
representation as shown in Figure 6(a). /e distributed techniques to deal with infrequent or unseen words during
representations of context (or surrounding words) are labeling [59].
combined in the CBOW model to predict the word in the Previous representations were not considering the
middle. /e CBOW has reshaped the word embedding [51]. morphological relation of words to disambiguate the unseen
/e continuous bag of representation is applied with a words. Many studies have presented methods that auto-
neural network model to achieve improved accuracy in matically extract features from the documents. /ese have
classification [52, 53]. Another model is designed called used infrequent words that produce a high variety of low
skip-gram that further reshaped the word embedding [54]; anticipated relations between the text documents. /is kind
its architecture works in reverse of what the continuous bag- of information once aggregated provides potentially less
of-words model does. /e model predicts each context word obvious and hidden relations in the text documents. Using
from the target word as shown in Figure 6(b). It iterates on less-frequent words with lexical constraints has reduced the
the words of each sentence in the given corpus and uses the associated cost of knowledge re-engineering, and it was able
current word to predict its neighbors (its context); thus, the to process many documents from a large number of domains
model is called “skip-gram” (local context window) [55]. [59, 60]. /ese methods help for better representations of
Weighted words calculate document similarity directly text documents especially handling unseen or less-frequent
from the word-count space, which takes longer to compute words. And the problem of scalability was also controlled
for large vocabularies. While counts of unique words give and associated with word’s semantical approach. /ere is
independent evidence of similarity, semantic similarities another approach that was proven most efficient in a do-
between words are not taken into consideration. Word main-specific text representation, proper nouns, an inter-
embedding techniques solve this problem, but they are mediate solution between noun phrases and named entities.
constrained by the need for a large corpus of text data sets for /is technique has reduced the ambiguity that occurred due
training. /e word embedding algorithms were developed to particular associated nouns with more than one named
using the word and its closest neighbor. /ere was an ap- entity category [31]. Recent approaches are concentrating on
proach suggested by authors for generating a word em- capturing context beyond the word level to produce per-
bedding GloVe (global vector, combines count- and predict- formance by giving a more structured and semantic notion
based methods) model for distributed word representation. of text [61].
/e unsupervised learning algorithm, where a model is
trained on overall statistics of word-word co-occurrence that
how often it appears in a corpus, and the result obtains the 2.2.2. Context-Aware Approaches. Context-aware classifi-
vector representation of words with linear substructures of cation approaches essentially find and employ term asso-
the word vector space [56]. GloVe’s solution is to count how ciation information to increase classification effectiveness.
many times a term i (context word) in another term j (target /ey allow the presence or absence of a term to impact how it
word) occurs. /e purpose is to establish a meaning for the contributes to a classification outcome. Context is a concise
word i and word j as to whether the two words occur close to term referring to high-level semantics. It may be taken in
N-word apart or not. /e encoding vector includes the ratio several ways and used in a variety of dimensions. We cate-
of two words specifically recognized as a count-based system gorize context-based classification systems according to how
8483, 2022, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.1155/2022/1883698 by Somalia Hinari NPL, Wiley Online Library on [13/01/2025]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Computational Intelligence and Neuroscience 7

CBoW Skip-gram
input output

V(t-2 )
V(t-2 ) input projection
projection output
V(t-1 )
V(t)
V(t-1 )
V(t)

V(t+1)

V(t+1)
V(t+2)

V(t+2)

(a) (b)

Figure 6: /e word2vec algorithm uses two alternative methods: (a) continuous bag of words (CBoW) and (b) skip-gram (SG) [50].

the context was understood and what features were used to Pretrained word embeddings are the solution to all of the
determine it. above difficulties. /e studies have consistently shown the
/e authors have come up with improved embedding for importance of transfer learning by pretraining a neural
texts, such as word2vec, which transforms a word into an n- network model on an established problem in the field of
dimensional vector. To map the words into an Euclidean computer vision and then doing fine-tuning utilizing the
space, we can go through an approach to creating sequence learned neural network as the foundation for a new purpose-
embedding that brings a sequence into an Euclidean space. specific model. It is demonstrated in recent years that a
One of the sequential data function learning problems is related approach may be effective in several tasks relating to
called sequence embedding [62], where the aim is to convert natural language. /is is another kind of word embedding;
a sequence into a fixed-length embedding. /is approach is a the classification algorithms provide a greater sense of
highly strong tool for identifying correlations in text corpora learning the features of such embeddings. Yet such em-
as well as word similarity. However, it falls limited when it beddings do not take the word order or the word meaning of
comes to capturing out-of-vocabulary words from a corpus. each sentence into consideration. /is is where ELMo
RNN-based models interpret the text as a sequence of words (embedding from models of language) comes into action.
and are intended for text classification to capture word de- ELMo is a contextual embedding that takes into consider-
pendencies and text structures [63]. By adding a memory cell ation the terms that surround it. It models word use
to remember values over arbitrary time intervals and three characteristics such as morphology and how it is used in
gates (input gate, output gate, and forget gate) to control the different contexts. /e term vectors are learned features of a
flow of information into and out of the cell, LSTM addresses deep bidirectional language model (biLM) internal state that
gradient vanishing or exploding problems experienced by the is pretrained on a broad text corpus. /e authors have
RNNs. In a recursive method, the authors expand the chain- demonstrated that these representations can be readily ap-
structured LSTM to tree structures, using a memory cell to plied to current frameworks and greatly strengthen the state-
store the background of multiple child cells or multiple de- of-the-art NLP issues such as addressing queries, textual
scendants. /ey claim that the new model offers a realistic way entailment, and interpretation of emotions [62]. A Trans-
to understand contact between hierarchies between long former [69] is another solution to working with long de-
distances, for example, language parse structures [64]. To pendencies such as LSTM. LSTM is long short-term memory,
capture text features, the authors also incorporate a bidirec- a sort of neural network that has a “memory unit” capable of
tional-LSTM (Bi-LSTM) model with two-dimensional max- maintaining knowledge in memory over strong periods
pooling [65]. /e seq2seq model is used in various NLP helping it to learn longer-term dependencies [22]. /e
applications [66, 67]. Most real-world problems have a data Transformer is based through the encoder and decoder on an
set with a substantial number of unusual words. /e em- attention process as shown in Figure 7. /e Transformer
beddings learned from these data sets are unable to produce allows the use of this method to store knowledge about the
the correct representation of the word. To do this, the data set specific meaning of a given term in its word vector.
needs to have a large vocabulary. Words that appear fre- Unlike past deep contextualized language representation
quently help you create a large vocabulary. Second, when studies [62] that take into account the backward model as in
learning embeddings from scratch, the number of trainable the ELMo bidirectional LSTM, the authors proposed a new
parameters grows. As a result, the training process is slowed. type of language representation named BERT, which stands
Learning embeddings from scratch may also leave you for bidirectional transformer encoder representations. BERT
confused about how the words are represented [68]. uses a Transformer, a mechanism of attention that learns
8483, 2022, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.1155/2022/1883698 by Somalia Hinari NPL, Wiley Online Library on [13/01/2025]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
8 Computational Intelligence and Neuroscience

Output context given by the other, unmasked, words. /e BERT loss


Probabilities
function only considers the estimation of the masked terms
and excludes the estimation of the unmasked phrases. As a
Somax
result, the model converges slower than directional ones, a
Linear feature offset by its enhanced understanding of the context.
Centered on BERT’s masking technique [71], the authors
Add & Norm developed a novel language representation model enhanced
Feed
by knowledge-masking technique named ERNIE (enhanced
Forward representation by knowledge integration) [73], which in-
volves masking at the entity level and masking at the phrase
Add & Norm level. Strategy at the entity level covers entities that typically
Add & Norm consist of several terms. /e phrase-level technique covers
Multi-Head
Feed Attention the whole phrase consisting of many words that serve as a
Forward N× cohesive entity together. /eir experimental findings indi-
cate that ERNIE outperforms other standard approaches by
Add & Norm
N× Add & Norm
obtaining modern state-of-the-art outcomes on natural
Masked language processing activities, including natural language
Multi-Head Multi-Head
Attention Attention inference, conceptual similarity, named-entity identifica-
tion, emotion analysis, and question answering. DistilBERT,
a technique for pretraining a smaller general-purpose lan-
guage representation model that can later be fine-tuned with
Positional Positional
Encoding + + Encoding high performance on a wide range of tasks like its bigger
equivalents, is created. While most previous research fo-
Input Output
Embedding Embedding cused on using distillation to build task-specific models
[74, 75], this study uses knowledge distillation during the
pretraining phase and demonstrates that it is possible to
Inputs Outputs reduce the size of a BERT model by 40% while retaining 97%
(Shied right)
of its language understanding capabilities and being 60%
Figure 7: /e transformer architecture. faster [76].
/e Transformer paradigm is generally popular in sev-
eral tasks relating to natural language processing. Using a
contextual connections between words (or subwords) in a transformer is therefore also an expensive operation since it
document. /e authors have argued that conventional requires the method of self-attention. /e Transformer
technologies limit the power of the pretrained representa- employs an encoder-decoder design that includes stacked
tions, especially for the approaches to fine-tuning [70]. /e encoder and decoder layers. Two sublayers comprise en-
main constraint is the unidirectional existence of modern coder layers: self-attention and a positionwise feed-forward
language models, which restricts the range of frameworks layer. Self-attention, encoder-decoder attention, and a
that can be used during pretraining. BERT is structured to positionwise feed-forward layer are the three sublayers that
pretrain profound bidirectional representations from un- comprise decoder layers. Self-attention assumes we are
labeled text documents by jointly conditioning across both conducting the task of attention on the sentence itself, as
layers in both the left and right contexts [71]. In setting compared to two separate sentences. Self-attention allows
language modeling, transformers can acquire longer-term defining the connection in a single sentence between the
dependence but are constrained by a fixed-length context. words. It is the function of self-attention that adds to the
/e authors suggested a novel Transformer-XL neural ar- expense of utilizing a transformer. /e quadratic structure of
chitecture that allows learning dependence to interrupt self-attention, however, constrains its operation on long text.
temporal coherence beyond a fixed length. It consists of a Attention is described using the (query, key, and value)
recurrence function at the segment level and a novel posi- model. A query Q is a “context,” and in previous equations,
tional encoding scheme [69]. Learning regarding the in- the prior concealed state is employed as the query context.
ductive transfer has significantly impacted computer vision, Based on what we already know, we want to know what
but current techniques in NLP also need complex task happens next. /e value represents the features of the input.
modifications and preparation from scratch. /e authors /e phrase “key” is just an encoding of the word “value.” To
suggested universal language model fine-tuning (ULMFIT), attract attention, the query’s relevancy to the keys is
an important transfer learning approach that can be ex- established. /e associated values that are unrelated to the
tended to any NLP task, and implemented techniques that query are then hidden. /e authors follow a method of fine
are essential to fine-tuning a model of language [72]. to coarse attention on multi-scale spans by binary parti-
A total of 15% of the words in every sequence are tioning (BP); they suggest BP-Transformer. BP-Transformer
substituted with a mask token before feeding word se- has a strong balance between the complexity of computa-
quences into BERT. /e model then tries to determine the tions and the capability of models. /e authors performed a
actual context of the masked words in the list, based on the series of experiments on text classification and language
8483, 2022, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.1155/2022/1883698 by Somalia Hinari NPL, Wiley Online Library on [13/01/2025]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Computational Intelligence and Neuroscience 9

processing, showing that BP-Transformer performs superior structured, semiunstructured, or fully unstructured forms.
to previous self-attention models for long text [77]. /e To perform the text classification task, it is always required to
Binary-Partitioning Transformer attempts to boost the self- process the raw corpus data. /ere are many steps involved
attention mechanism’s usefulness by considering the in data processing; generally, data cleaning, that is, orga-
transformer as a graph neural network. Any node in this nizing the data as per the structure and removal of unneeded
graph represents an input token. subtexts; tokenization, that is, breaking up text into words;
Another research suggests a variant of the Neural At- normalization, that is, converting all texts into the same case,
tentive Bag-of-Entities, which is a neural network algorithm removing punctuation (stemming leaves out root forms of
that uses entities in a knowledge base to conduct text the verb and lemmatization); and substitution, that is,
classification. Entities include unambiguous and specific identifying candidate words for translation, performing
syntactic and semantic signs that are useful for catching word sense disambiguation [80]. In one of the studies [81],
semantics in documents. /e authors put together easy high- the researchers have also focused on how machine learning
recall dictionary-based entity recognition, with a neural techniques are needed to design to recognize similar texts
attention system that helps the model concentrate on a when text data are downloaded from multiple and hetero-
limited number of unambiguous and specific entities in a geneous resources. In the labeling task, the text documents
text [78]. /e model first identifies entities to whom this are labeled with two commonly used approaches; one is to
name might be addressed (e.g., Ap Inc., Apple (food)) and label each part of the text individually, and the second is to
then describes the entity using the weighted average of all label the group of texts. /e first approach includes different
entities’ embedding. /e weights are measured using a supervised learning methods, and the second is called multi-
modern method of neural attention that helps the model instance learning [82, 83].
concentrate on a specific subset of entities that are less
ambiguous in context and more important to the document.
It is the time for NLP when the transition started as we 2.4. Data Preprocessing: Dimensionality Reduction.
listed the unsupervised pretrained language models that had Dimensionality reduction is a crucial approach in data pre-
made breakthroughs in different tasks of understanding the processing for preparing data for text classification. It is done
natural language, such as named-entity identification, emo- to reduce the classifier’s memory requirements and execution
tion interpretation, and question-answer records for art time, hence increasing the learning model’s efficiency and
performance beginning one after another in that short period. efficacy. /e dimensions of data are increasing as the volume
/ese NLP models indicate that a lot more is yet to come, and of data grows. To map large dimensions to space with low
the authors look forward to researching and implementing dimensions, it becomes necessary to reduce the dimensions of
them. However, the authors proved that a single pretrained the data [84, 85]. /e purpose of decreasing high-dimensional
language model may be applied as “a zero-shot task transfer” space is to find a subdimensional space that is less complex
to execute basic NLP tasks without the requirement for fine- and can adjust the learning model to the greatest extent
tuning on a training example data set. While this was an possible. In some cases, several researchers have noticed that
encouraging proof of concept, the best case performance only the number of features in samples is substantially larger than
equaled certain supervised baselines on a single data set. the number of samples. /is leads to a problem known as
Performance on most tasks was still well behind even simple overfitting [86]. As a result, dimensionality reduction be-
supervised baselines. Across one order of magnitude of comes necessary to avoid overfitting. Feature selection and
scaling, the study found generally consistent log-linear trends feature extraction are two significant subtasks in lowering
in performance on both transfer tasks and language modeling dimensionality. /e process of supplying the part of the
loss. GPT-3 performs well on NLP tasks in the zero- and one- original attributes that are necessary for the task is known as
shot settings and, in the few-shot setting, is sometimes feature selection. Feature extraction is a technique for
comparable with, and occasionally surpasses, the state of the changing space with many dimensions into a new space with
art (although the state-of-the-art is held by fine-tuned few dimensions, to increase data variance [87].
models). /is might imply that bigger models are better meta-
learners [79].
2.4.1. Standard Feature Selection Methods. /e following are
GPT-3 approaches the performance of a fine-tuned
the goals of the feature selection method:
RoBERTas a baseline on the “Challenge” version of the data set,
which has been filtered to questions (multiple-choice questions (i) To improve the predictability of the classifiers
on common sense reasoning collected from 3rd to 9th-grade (ii) To create a cost-effective classifier while also
science examinations) that standard statistical or information speeding up the process
retrieval algorithms are unable to accurately answer. GPT-3
(iii) To improve the clarity of the data-gathering
marginally outperforms the same fine-tuned RoBERT baseline
approach
on the “Easy” version of the data set. However, both of these
findings are much inferior to MBART’s overall SOTAs. /e approach of finding a portion of the original at-
tributes that are significant for the training set is known as
feature selection. It is used to create an effective classifier
2.3. Data Preprocessing: Data Cleaning. /ere are a lot of text while keeping important lexical properties in vocabulary
data available today, and data are being grown daily in [88]. It removes the noisy attributes that tend to reduce
8483, 2022, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.1155/2022/1883698 by Somalia Hinari NPL, Wiley Online Library on [13/01/2025]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
10 Computational Intelligence and Neuroscience

accuracy [89]. /e goal of feature selection is to decrease the certain label or class, and the word is also present in the
feature space by picking a subset of Kay attributes and document. It can be written as follows:
minimizing overfitting while maintaining text classification m m
performance [90]. For instance, a feature or attribute set IG(t) � − 􏽘 P Ci 􏼁logp Ci 􏼁 + p(t) 􏽘 P Ci | t􏼁logp Ci | t􏼁
X � 􏼈X1 , X2 , . . . ., XN 􏼉 i�1 i�1
When N is a group of feature sets, 2N feature subsets are m
created, each of which is represented by a vector of size N. + p(t) 􏽘 P Ci | t􏼁logp Ci | t􏼁.
/e methods identify a feature subset of size K, K < N i�1
without losing the accuracy of the full feature set. It has been (2)
the subject of investigation, and the writers have so far
provided many techniques. Filter, wrapper, and embedded /e utility of feature t in the classification is measured
are the three types of approaches available [91, 92]. Filter- by this formula. If IG is higher than the prior value
based feature selection offers several ways to evaluate the without the feature t, the current feature t is more
information value of each feature. /e filter approach picks relevant for classification. In other words, the dis-
the top-N features based on the results of various statistical criminating power of the term t increases as the value of
tests to identify a connection with the target label or to the information gain IG increases. Here, IG stands for
establish which attributes are more predictive of the target information gain; Ci is the i-th class; P(Ci) is the
label, and it is independent of any learning algorithms. probability of an i-th class; and m is several target
Because this technique ignores feature dependencies, it is classes. P(t) is the probability the feature t appears in
computationally efficient. /e wrapper technique evaluates a the documents and the probability P(t) for feature; t
subset of features based on their utility to a given class. It does not appear in the document. P(Ci|t) is the con-
uses a learning model to assess the subset of features based ditional probability of the feature t appearing in i-th
on their predictive power, but it is significantly very costly class. P(Ci | t) is the conditional probability of the
owing to repetitive learning and cross-validation. Embedded feature t that does not appear in i-th class.
techniques are analogous to wrapper methods, except that /e Chi-square test is a statistical strategy for
they include feature selection during the training phase [93]. assessing the relationship between a set of categorical
(1) Filter methods features using their frequency distribution and de-
Univariate Feature Selection. To determine the link termining how much the findings differ from the
between the features and the target variable, univariate predicted output [95, 96]. /is may be determined
feature selection evaluates each feature independently. given events A and B, which are considered to be
Because they pertain to linear classifiers created using independent if
single variables, the following univariate feature se- p(AB) � p(A)p(B). (3)
lection methods are linear.
/e filter approach chooses attributes without focusing /e occurrence of the term and the occurrence of the
on the core goal of improving any classifier’s perfor- class are the two events in feature selection. /e terms
mance. To score the attributes, it uses the data’s most are then ranked according to the following value. Chi-
important attributes. If d features or attributes are square can be calculated from the following equation:
identified as S, the goal of a filter-based approach is to 2
pick a subset of m < d features, T, which maximizes 2 􏼐Nt,C − Et,C 􏼑
CHI (t, C) � 􏽘 . 􏽘 , (4)
some function F: tϵ{0,1} C∈{0,1}
EtC

τ ∗ � argmaxτ⊑S F(τ), s.t.|τ| � m. (1) where t denotes the feature, C denotes the specific class,
Nt,C is the frequency of feature t and class C occurring
It finally settles on the top-m rated features with the together, and Et,C is the frequency of feature t occurring
highest scores. /is number is known as joint mutual without class C. /e chi-square between each feature
information, and maximizing it is an NP-hard opti- and class is computed, and the features with the highest
mization problem since the number of potential feature chi scores are chosen.
combinations rises exponentially. Fisher score calculates the variance of the predicted
/e following are the often used linear univariate filter value from the actual value to get the information score
techniques in text classification: or how much knowledge one variable has about the
/e information gain approach selects features based unknown parameter on which the variable depends,
on the item’s frequency concerning the class/label and when the variance is the smallest, the information
prediction. Researchers have demonstrated that by score is the highest. For a support-vector-based feature
removing superfluous features without modifying the ranking model, researchers employed Fisher’s linear
features, the approach may lower the vector dimen- discriminant [97–99].
sionality of text and enhance classification results [94]. For instance, let µkj and σ kj be the mean and standard
It adds the most value when the text corresponds to a deviation of the k-th class, concerning the j-th feature.
8483, 2022, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.1155/2022/1883698 by Somalia Hinari NPL, Wiley Online Library on [13/01/2025]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Computational Intelligence and Neuroscience 11

Let μj and σ j represents the mean and standard de- 1


viation of the entire training data concerning the j-th max D(S, c), D � 􏽘 􏼐I􏼐xi; c􏼑􏼑. (8)
|S| x ∈S
feature. /e Fisher equation for determining the j-th i

feature’s score is stated as follows:


However, this results in a high level of redundancy, that
j j 2
is, a higher level of a dependency across features. As a
􏽐ck�1 nk 􏼐μk −μ 􏼑 result, to locate mutually exclusive features, minimum
F􏼐xj 􏼑 � , (5)
σ 2j redundancy can be used [102].

j 1
where σ 2j is computed as 􏽐ck�1 nk (σ k )2 . Top-m features min R(S), R � 􏽘 I􏼐xi , xj 􏼑, (9)
with higher fisher scores are chosen by the algorithms. |S|2 xi, xj ∈S
Pearson’s correlation coefficient is used to measure
linear dependency between two continuous variables where I(xi,xj) is the mutual information between fea-
by dividing their co-variance by the product of their ture i and j.
standard deviation, and its value ranges from −1 to +1. Multi-Variate Relative Discriminative Criterion. /e
In two variables, a–1 value signifies a negative corre- author offers a multi-variate selection strategy that
lation; a +1 value shows a positive correlation; and a 0 takes into account both feature relevance and redun-
value represents no linear association [93]. dancy in the selection process. /e RDC is used to
Using vectors, Pearson’s coefficient r can be computed assess the relevance, whereas Pearson’s correlation is
as follows: used to assess redundancy between features [103]. /is
measure boosts the rankings of terms that are exclu-
x1 − x1 L􏼁 x2 − x2 L􏼁 sively found in one class or whose term counts in one
r� 􏼌􏼌 􏼌􏼌􏼌􏼌 􏼌􏼌 , (6) class are much higher than in the other.
􏼐􏼌􏼌x1 􏼌􏼌􏼌􏼌x2 􏼌􏼌􏼑
􏼌􏼌 􏼌􏼌
􏼒􏼌􏼌􏼌dfpos wi 􏼁 − dfneg wi 􏼁􏼌􏼌􏼌􏼓
where x1 is the mean of the vector x1 and similarly for RDC􏼐wi , tcj wi 􏼁􏼑 � ,
x2 , L is the vector of 1s, and |x| is the magnitude of min􏼐dfpos wi 􏼁, dfneg wi 􏼁􏼑∗ tcj wi 􏼁
vector x.
(10)
Variance threshold is a technique for reducing vector
dimensionality by deleting all low-variance features. where dfpos (wi ), dfneg (wi ) are the collection of pos-
Features that have a lower training-set variance than itive and negative text documents, respectively, in
the threshold will be deleted [100, 101]. which the term wi is occurred. /e word may be re-
􏽱��������� peated several times in specific documents and rep-
2 resented by tcj (wi ). Instead of adding together RDC
􏽐 xi − x􏼁 (7)
s� . values for all term counts of a term, the area under the
(n − 1)
curve (AUC) for a difference graph is treated as term
/e equation may be used to find features that have a rank.
variation below a given threshold. When the feature Researchers have been looking for novel approaches to
does not vary much within itself, it is seen to have low increase classification accuracy while also reducing
predictive potential. processing time. /e author has provided a differentiate
feature selector, a filter-based strategy that has picked
Multi-Variate Filter Methods. During the assessment of unique features that have term properties while elim-
the multi-variate filter selection approach, the inter- inating uninformative ones [92]. It provided efficiency
dependencies of features are also taken into account to by reducing processing time and improving classifi-
choose relevant features. cation accuracy.
It is based on mutual information that discovers the (2) Wrapper Methods. Wrappers’ approaches are bound
features in a feature set with the highest dependency to a certain classifier; the methods choose a subset of features
with the target label. However, it is not appropriate for based on their influence on the classifier by assessing the
use when the goal is to achieve high accuracy with a prediction performance of all potential feature subsets in a
small number of features. given space. It signifies that the features subset will be
Alternatively, it may utilize max relevance, which de- assessed by interacting with the classifier, which will improve
tects features with a dependency by averaging all the classification technique’s accuracy. As the feature space
mutual information values between all features xi and expands, the computing efficiency suffers as a result of this
target label c. S refers to features, and I represents method. Wrappers are used to choose features for other
mutual information; in the following equation, it is models as filters. /e procedure may be accomplished in
calculated between feature i and class c: three ways: the first methodology employs a best-first search
technique; the second methodology employs a stochastic
approach such as random selection; and the third
8483, 2022, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.1155/2022/1883698 by Somalia Hinari NPL, Wiley Online Library on [13/01/2025]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
12 Computational Intelligence and Neuroscience

methodology use heuristics such as forward and backward selection procedure and improve the classification
passes to include and omit features. technique’s accuracy [108, 109].
/e commonly preferred nonlinear multi-variate
Multi-Variate Feature Selection. Univariate feature wrapper methods in the text classification are discussed
selection approaches are computationally efficient, but as follows:
they eliminate features owing to a lack of interaction
Nonlinear kernel multiplicative updates entail itera-
between features that, when combined, may have of-
tively training a classifier and rescaling the feature set
fered important information regarding classification
by multiplying it by a scaling factor that lowers the
[104, 105]. When evaluating the performance of fea-
value of less impacted features. Nonlinear techniques
tures, multi-variate takes into account the interde-
can outperform linear algorithms by selecting a subset
pendencies between them. “Linear multi-variate”
of features [110].
employs linear classifiers made up of a subset of fea-
tures, with the score of feature subsets being calculated Relief is based on instance-based learning. Each feature
based on classification performance. Nonlinear multi- receives a value ranging from –1 to +1 based on how
variate, on the other hand, use nonlinear classifiers to well it matches the desired label. /e algorithm’s scope
complete the task. is binary-classification-compatible [111, 112].
/e following are the most often used linear multi-
variate wrapper approaches in in-text classification: (3) Embedded Methods. In terms of computing, “embedded
Recursive Feature Elimination. It is a recursive strategy methods” outperform wrappers, but they conduct selection
that ranks features according to a key measure. During features as a subpart of the learning methodology, which is
each cycle, the significance of features is assessed, and primarily exclusive to the learning model and may not
less relevant features are removed. To design ranking, function with any other classifier.
the opposite process is utilized, in which features are /e commonly preferred embedded methods in the text
rejected. From this rating, this technique extracts the classification are discussed as follows:
top-N features [106]. /is is a greedy optimization that In social sciences, the LASSO method is generally used
seeks the highest performing feature subset. [113]. To alleviate the dimensionality problem, it penalizes
features with large coefficients by inserting a penalty during
Forward/Backward Stepwise Selection. It is an iterative
the log-likelihood maximization procedure. By picking a
procedure that begins with the examination of each
correct weight and reducing dimensionality, LASSO assigns
feature and picks the one that produces the best per-
zero to some coefficients. When there is a strong correlation
forming model, based on some predetermined criteria
between some features, it creates a difficulty [114].
(like prediction accuracy). /e next step is to examine
Ridge Regression lowers the complexity of a model by
every potential combination of that selected feature and
reducing coefficients while keeping all of its features. /e
the following feature, and if it improves the model, the
issue with ridge regression is that features are retained. If the
second feature is chosen. /e model continuously
feature collection is huge, the problem remains complicated
appends the list of features that best improve the
[115].
model’s performance in each iteration until the
Elastic Net calculates a penalty that is a mix of LASSO
requisite features subset is picked. In the backward
and ridge penalties. /e elastic net penalty may be readily
feature selection approach, the method starts with
handled to give LASSO or ridge penalties extra power. It has
the whole collection of features and discards the least
a grouping effect, with high correlation features tending to
relevant feature in each iteration, improving the
be in or out of the feature subset. It incorporates both L1 and
method’s speed. /is method is repeated until no
L2 regularization techniques (LASSO and ridge). By fine-
improvement is shown when features are removed,
tuning the settings, it aids in the implementation of both
and the best subset of features is found. In com-
strategies [116].
parison to other techniques, the researcher devel-
oped a rapid forward selection methodology for
picking the optimal subset of features that required 2.4.2. Text Feature Extraction Methods. After selecting the
less computing work [107]. features and representing N documents by d-dimensional
/e Genetic Algorithm uses a feature set to generate a features vectors 􏼈X1 , X2 , . . . ., XN 􏼉. Sometimes original terms
better subset of features that are free of noise. At each in the form of features may not be optimal dimensions for
step, a new subset is formed by picking individual text document representation. /e text feature extraction
features in the correct sequence and merging those methods try to solve these problems by creating new feature
using natural genetics procedures. /e result is cross- space Y or artificial terms [117]. It requires (a) a method to
validated variance divided by the percentage of right convert old terms to new terms and (b) a method to convert
predictions. /e end outcome may be mutated. /is document representation from old to new. A popular ex-
procedure aids in the creation of a feature set of in- ample commonly used for this purpose is principal com-
dividual features that are more appropriate for the ponent analysis (PCA) in which a feature set Yi is selected in
model than the initial feature set. /e chaotic genetic a manner that the variance of the original feature vectors is
algorithm was designed to simplify the feature maximized in the direction of new feature vectors. /is is
8483, 2022, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.1155/2022/1883698 by Somalia Hinari NPL, Wiley Online Library on [13/01/2025]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Computational Intelligence and Neuroscience 13

done by computing eigenvectors of the covariance matrix of 2.5. Classifiers for Classification Task. /e classifier is trained
the original vectors. /e drawback of the PCA is the time it based on the selected features from the text documents. /e
takes to evaluate eigenvalue decomposition to compute the selection of appropriate features in feature space decides the
principal component for each class/label when applied to a performance of learning models. Machines understand
large data set. /is is overcome by the researcher by using the numbers more easily than texts as input. So texts as tokens
power factorization method (PFM) to find a fixed quantity of are required to be converted into numbers (vectorization)
eigenvectors from a data set [118]. Another commonly for most of the learning algorithms. Vectors are combined to
preferred method for feature extraction is latent semantic originate vector space to apply statistical methods for
indexing (LSI) that uses singular value decomposition of the checking document relatedness. Each algorithm offers a
term correlation matrix computed from a large collection of different document representation for text classification.
text documents. /is technique is used to address the Researchers offer several classification methods that work on
problem of deriving from the use of synonymous and po- the vector representation of texts. /e basic assumption for
lysemous words as dimensions of the text document rep- vector representation of texts is known as the contiguity
resentation. But the disadvantage is to compute the correct hypothesis. It states that text documents belonging to the
number of latent components that proves computationally same class develop a contiguous region and regions of
expensive. Another method that helps for optimal dis- different classes do not overlap. /e relatedness of the
crimination of data is linear discriminant analysis (LDA) documents can be evaluated on 2D space based on cosine
[99]. It identifies the linear collection of features that similarity or Euclidean distance.
best explain the data. It tries to find the model that can /e authors have presented naı̈ve Bayes, a linear clas-
explicitly differentiate between the classes of data. Latent sifier, approach to vectorize the text documents according to
Dirichlet allocation is another method that explains that probability distribution with two commonly used models:
each text document is a mixture of latent topics and each multi-variate Bernoulli event and multi-nomial event; the
word in that document is attributable to one of the topics of features with the highest probability were chosen to reduce
that document. /is is most preferred for topic modeling the dimensionality [124].
[119]. /is is a generative probabilistic model.
P Ck 􏼁
A newer approach is a simplified version of stochastic PCk |di � Pdi |Ck ∗ . (11)
neighbor embedding that creates much better visuals by P di 􏼁
eliminating the potential to cluster points together in the /e output of the classifier is the probability of the text
map’s centers known as t-SNE. It visualizes high-dimen- document di is belonging to each class Ck , and it is a vector
sional data by assigning a two- or three-dimensional map to of C elements. In a way of text classification scenario, we
each data point. When it comes to constructing a single map could compute Pdi |Ck using bag-of-words as follows:
that displays structures of several sizes, t-SNE outperforms
previous approaches. On almost all of the data sets, the Pdi |Ck � P BoW di 􏼁|Ck 􏼁 � Pw1,i w2,i , , , w|V|,t |Ck . (12)
analysis shows that t-SNE produces visuals that are much
superior to those produced by the other approaches [120]. /e problem can be reduced to compute the probability
Furthermore, the authors provide a technique UMAP of each word wj,i in class Ck as follows:
(uniform manifold approximation and projection) that is |V|
comparable to t-SNE in terms of visualization quality and, in Pdi |Ck � 􏽙 Pwj,i |Ck . (13)
certain ways, retains more of the global structure while j
providing better run time efficiency [121]. It is based on
Laplacian eigenmaps as a mathematical foundation. Umap /e naı̈ve Bayes has a high bias for a nonlinear problem
can scale to far bigger data sets than t-SNE. It is a general- because it can model one type of class, that is, a linear
purpose dimension reduction strategy for machine learning hyperplane. /e bias is the statistical method of evaluating
since it has no computational constraints on embedding the performance of a classifier that how accurately a classifier
dimensions. /e approach is used in the study to evaluate the classifies the texts into the correct class with less error. /e
uniqueness of subjects, important phrases and features, learning task is the last activity in classification, but reducing
information dissemination speed, and network behaviors for the feature dimensions is more concerned with the efficiency
COVID-19 tweets and analysis [122]. To increase the de- of the classifier model. /e possibility of salient feature
tection of relevant themes in the corpus and analyze the reduction caused by using classifiers can be overcome by
quality of created topics, they use UMAP, which finds using the model SVM that identifies the best decision-
distinctive clustering behavior of separate topics [120]. boundary between feature vectors of the document with
Another study used UMAP to depict the matching word their categories [125]. /e SVM entails an optimal classifier
vector spaces of a pretrained language model using LaTeX that guarantees the lowest classification error. /e SVM
mathematical formulations. In the LaTeX formula domain, computes a hyperplane that falls between the positive and
they develop a state-of-the-art BERT-based text classifica- negative example of the training set.
tion model augmented by unlabeled data (UL-BERT) [123]. 􏼌􏼌 􏼌􏼌
D � 􏼚 xi , yi 􏼁􏼌􏼌􏼌xi ∈ RD , yi ∈ {−1, 1}􏼌􏼌􏼌􏼛i�ni , (14)
8483, 2022, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.1155/2022/1883698 by Somalia Hinari NPL, Wiley Online Library on [13/01/2025]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
14 Computational Intelligence and Neuroscience

where the minus sign represents the negative hyper- computational capacity of finding nearest neighbors in KNN
plane, the positive sign points to the positive hyperplane, i or locating decision boundaries in SVM, and (c) increasing
ranges from 1 to L (training examples), (xi,yi) represents efficiency by not compromising accuracy [128]. /e model is
feature vectors of each document, RD is a vector space chosen that optimizes the fit to the training data.
having a dimension of D. and D is evaluated to +1 and −1 for
H∗ � arg max fit(H|D). (17)
positive and negative hyperplanes, respectively. H
/e naı̈ve Bayes itself results in the best classification
model if it is trained on a high volume of data. However, /e supervised learning classification algorithms such as
feature reduction remains an issue. So naı̈ve Bayes is used as naı̈ve Bayes, SVM, and KNN use a bag of words to design a
a prestep to SVM that converts text documents into vectors classifier. None of these methods take the order of words
before the classification task starts. /is resulted in im- into consideration that can lead to the loss of some valuable
proving the whole system while spending quite an appro- information. In natural-language-based problems, the order
priate classification time by reducing to low-dimensional of the words, for example, multi-words, has a meaning (like
space. But, in certain cases, the majority of features are names of organization or person) that is not considered by
redundant with each other; the author has presented a di- the learning models trained on individual words of the texts.
vergence-based feature selection method without relying on a /e algorithm n-grams consider the sequence of n-adjacent
complex dependence model where the maximum marginal words from the selected text phrases. /e n-grams behave
relevance-based feature selection was outperformed by the like the individual word’s representation as feature vectors.
SVM [126]. /e paper has suggested the need for novel /e value of n may range from 1 to the upper value [129].
criteria to measure the relevance and novelty of features /is proves very beneficial in short text documents where
separately and provided the linear combination as the metric. the number of n-grams is less in number [89]. /e author
/e studies have mentioned the KNN, a nonlinear proposes another representation for character n-grams to
classifier, where the algorithm classifies the document by introduce the enhancement of the skip-gram model, which
moving through all the training documents that are like that considers subword into account. It considers the word
document. /e KNN model frames the documents in the morphology while training the model, and words are rep-
Euclidean space as points so that the distance between two- resented by the sum of its character n-gram [55].
point u and v can be calculated as follows: Many researchers have used a decision-tree-based algo-
rithm (decision support tool) that represents a tree-structure-
d
2 based graph of decisions [130]. /e commonly used decision
D(u, v)2 � u − v2 � (u − v)T (u − v) � 􏽘 ui− vi 􏼁 . (15) tree algorithms are ID3, C4.5, and C5. /e algorithm presents
i−1
each intermediate node (labeled as terms and branches
/e classifier finds the K-value that is the factor that represent weight) that can split into subtrees and ends at leaf
represents a collection of documents from all the documents nodes (represents the class/label/outcome of the problem).
closest to the selected document in that space [127]. If there Decision tree structures are rule-based solutions. A rule can
are too many features, KNN may not operate effectively. As a be designed by forming a conjunct of every test that occurs on
result, dimensionality reduction techniques such as feature the path between the root node and the leaf node of the tree.
selection and principal component analysis may be used. /e rules are formed after traversing every path from a root to
the leaf node. Once the decision tree and rule are framed, it
􏽢 (x) � yn∗
y where n∗ � arg min dist x, xn 􏼁. (16) helps assign the class/label for a new case [129, 131]. It was
n∈D
evaluated in the study that decision trees result better than
/e study mentions that using KNN increases the naı̈ve Bayes in terms of accuracy but a little worse than KNN
overhead to calculate the K-value of all the documents with methods [132]. As a result, the authors introduce the boosting
all other training documents with the largest similarity or data classification approach. /e boosting algorithm is a
closet to the selected document. Also, the variation in the method for combining many “poor” classifiers into a single,
number of training sample documents in different categories powerful classifier. /e boosting technique is used on decision
leads to a decline in accuracy. Due to the high variance and trees in the study, and the boosted decision tree performs
complex regions between classes, it becomes sensitive to better than an artificial neural network [133]. It is expected to
noise documents. Sometimes, the document tends to mis- find widespread use in a variety of fields, particularly text
classify if it occurs very relevant to a noise document in the classification. Gradient tree boosting is another boosting
training set and sometimes accurately classified if there is no strategy that builds an additive regression model using de-
presence of noise documents in the training set close to them. cision trees as the weak learner. Trees in stochastic gradient
/is ends up in high variance from the training set to the boosting are trained on a randomly selected portion of the
training set. High variance leads to overfitting. /e goal of training data and are less prone to overfitting than shallow
finding a good learning classifier is to decrease the learning decision trees [134].
error. Learning error is calculated from bias and variance or Neural networks or deep learning systems use several
bias-variance trade-off. /e traditional algorithms possess processing layers to learn hierarchical data representations
some limitations that attract the researchers to improve the and have reached state-of-the-art outcomes in most do-
efficiency by (a) reducing the computational overheads by mains. In the sense of natural language processing (NLP),
establishing low-dimensional space, (b) speeding up the several model designs and approaches have recently
8483, 2022, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.1155/2022/1883698 by Somalia Hinari NPL, Wiley Online Library on [13/01/2025]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Computational Intelligence and Neuroscience 15

progressed. In areas such as computer vision and pattern classification problems. In a feed-forward neural network,
recognition, deep learning architectures and algorithms have the information travels in one direction from the input layer
also made remarkable progress. Recent NLP research is now to hidden layers and then followed by the output layer. /ey
primarily focused on the application of new deep learning have no memory of the input received previously so lack in
approaches, continuing this development. In areas such as predicting what comes next. To overcome this, RNN (re-
computer vision and pattern recognition, deep learning current neural network) is preferred where information
architectures and algorithms have also made remarkable moves through the loop. In this paragraph, we discuss the
progress. Recent NLP research is now primarily focused on fundamental characteristics that have favored the popu-
the application of new deep learning approaches, continuing larization of RNNs in a variety of NLP tasks. Since an RNN
this development. /e performance of word embedding and performs sequential processing in sequence by modeling
deep learning strategies referred to in the following section is units, it may have the ability to generate the intrinsic se-
driving this development. quential structure present in language, where characters,
In the late 90s, the researcher found an application of words, or even phrases are units. Based on the previous words
nonlinear neural networks to text classification or topic in the sentence, words in a language establish their semantic
modeling [135]. In this model, a three-layered neural network meaning. /e disparity in interpretation between “computer”
was designed to learn a nonlinear mapping from training and “computer vision” is a clear example that states this.
documents to each class/label. Later on, the researcher pro- RNNs are perfect for language and related sequence modeling
poses convolutional neural network (CNN) for text classifi- activities to predict certain context dependencies, which
cation by considering the order of words in the phrases, and turned out to be a clear incentive for researchers to use RNNs
this outperforms SVM in terms of error rate [136]. CNN uses over CNNs in these fields.
the vector representation of text data considering the order of RNNs were originally three-layer networks in the NLP
words. Each word is considered a pixel, and the document is sense [139]. While deciding on the current input layer, it
treated as an image. /en the image is taken into |D| × 1 considers what it has learned from the previous inputs. Ba-
pixels, and each pixel represent a word as a |V| dimensional sically, in the architecture of simple RNN, the hidden units
vector. For instance, vocabulary V � {“classification”, create internal representations for the input patterns and
“course”, “I”, “love,” “NLP’,” “text”}, and words are taken as a recode these patterns in feed-forward networks using hidden
dimension of vectors in alphabetical order. And document units and a learning algorithm in a way that allows the
D � “I love NLP course”. /en the document vector would be network to generate the appropriate output for a given input.
X � [0010000 | 000100 | 00001 | 010000]. Typically, the hidden state of the RNN is assumed to be the
/e researcher mentions that to reduce the dimen- most important feature. It can be regarded as the memory
sionality of vector space, the vocabulary size must be kept portion of the network that accumulates data from other
low. Also, the n-gram algorithm ignores the fact that some n- steps.
grams share the constituent words; this is overcome by CNN /e formula of the current state in RNN can be written as
that learns the embedding of text regions by providing CNN mentioned below. A nonlinear transformation such as tanh,
with the constituent words as input, and this technique or ReLU, is taken to be the function f.
provides higher accuracy. To construct an informative latent
ht � f ht−1 , xt 􏼁, (18)
semantic representation of the sentence for downstream
activities, CNNs have the potential to extract salient n-gram where ht is the new state, ht–1 is the previous state, and Xt
features from the input sentence. /e author proposes a three- is the input at time t.
way enhanced CNN for classification for sentiment analysis, /e tanh function is commonly used as an activation
where decisions are divided into three parts accept, reject, and function. /e weights can be defined as the matrix Whh , and
delay. /e instances in boundary regions that are neither in input is defined by the matrix Wxh :
accept nor reject are reclassified by another classification
model. /is guarantees the enhancement in CNN to deal with ht � tan h Whh ht−1 + Wxh Xt 􏼁. (19)
boundary regions in a better way, resulting in model
3W–CNN [137]. Another study has shown the application of /e output can be calculated during test time as follows:
CNN in document modeling for personality detection based yt � Why ht . (20)
on text in the context of sentiment analysis [34]. Overall, in
contextual windows, CNNs are highly successful in mining /e output is then compared to the actual output, and
semantic hints. /ey are very data-heavy models, though. then the error value is computed. /e network learns by
/ey have a huge range of parameters that are trainable and backpropagating the error through the network to update
need tremendous training data. /is raises a concern as data the weights. But usual RNN has a short-term memory. /ese
shortage happens. Another unresolved concern with CNNs is basic RNN networks suffer from the issue of vanishing
their failure to model contextual long-distance data and gradient, which makes it very difficult to understand and
maintain sequential order in their representations [138]. adjust the parameters of the previous layers in the network.
Deep neural networks are difficult to train on data; it It is used in combination with LSTM, which has long-term
requires a lot of resources to get high performance. /e feed- memory, and it gives an extension to the memory of the
forward neural network commonly known as multi-layer usual RNN. Over the basic RNN, LSTM has additional
perceptron (MLP) is the most preferred technique in “forget” gates, allowing the error to backpropagate over an
8483, 2022, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.1155/2022/1883698 by Somalia Hinari NPL, Wiley Online Library on [13/01/2025]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
16 Computational Intelligence and Neuroscience

infinite amount of time steps. Comprising three gates: input, neither property is consistently observed by a Bi-LSTM with
forget, and output gates, taking a combination of these three a standard attention mechanism. /e layers of attention
gates, it determines the hidden state [22]. /e applications specifically weigh the representations of the input elements;
based on RNN and LSTM have been used in solving many it is also often believed that attention can be used to classify
NLP problems due to their capacity of capturing complex information that was considered relevant by models. In
patterns within the text [140]. It has also been used in se- another study, the authors evaluate if that assumption holds
quence labeling tasks in POS (part of speech) activity. It is by modifying weights of attention in already trained text
preferably used in topic modeling for fake news [141, 142], classification methods and examining the resulting varia-
sentiment analysis [143, 144], and negative speech detection tions in their predictions. Although experimenting with text
on social media. More recently, authors have suggested classification, the authors note several cases in which higher
another type of recurrent unit, which they refer to as a gated attention weights correlate with a greater impact on model
recurrent unit (GRU) [145]. It has been shown that RNNs predictions, they also notice several cases this does not hold,
employing any of these recurrent units perform well in tasks that is, where gradient-based attention weight rankings
(such as machine translation, speech recognition, or predict their effects better than their magnitudes [151]. In
depending parsing in text documents for NER) requiring contrast to the current work on interpretability, the authors
long-term dependency capture. /e application of gated in another research reported that they examined the at-
RNN is not limited to the mentioned tasks, but it can be tention mechanism on a more diverse set of NLP tasks that
applied to different NLP challenges [146]. GRU is a form of included text classification, pairwise text classification, and
recurrent neural network (RNN) that can process sequential tasks for generating text such as neural machine translation
data using its recurrent architecture. /e fundamental issue [152].
in text classification is how to improve classification accu- Although the hidden vectors represented by an attention
racy, and the sparsity of data, as well as semantics sensitivity model through encoding can be interpreted as internal
to context, frequently impedes text classification perfor- memory entries for the model, memory-augmented net-
mance. /e study introduces a unified framework to evaluate works integrate neural networks with an external memory
the impacts of word embedding and the gated recurrent unit type that the model could learn from it and respond to. For
(GRU) for text classification to overcome the flaw [147]. text classification, the study proposes a memory-augmented
Recurrent neural networks, in particular long short-term neural network called the neural semantic encoder [153]. In
memory [22], and gated recurrent neural networks [62] are another research, the authors introduce neural network
firmly known as state-of-the-art approaches in sequence architecture, the dynamic memory network (DMN), which
modeling. Within training examples, the inherently se- processes input sequences and questions, shapes episodic
quential nature of recurring models prevents parallelization, memories, and generates appropriate responses. /e model
which becomes important at longer sequence lengths, as possesses an iterative mechanism of attention that allows the
memory limitations restrict batching through examples. model to condition its attention on the inputs and outcomes
Recent work, through factorization tricks [148] and con- of previous iterations. In a hierarchical recurrent sequence
ditional computation [149], has achieved substantial im- model, these outcomes are then reasoned over to produce
provements in computational efficiency while also answers [154]. /e authors also stated that it is possible to
improving model output in the case of the latter. /e authors train the DMN end-to-end and obtain state-of-the-art text
have presented two simple ways of reducing the number of classification results using the Stanford Sentiment Treebank
parameters and speeding up the training of large long short- data collection.
term memory networks: the first is the “matrix factorization Sequential processing of text is one of the system inef-
by design” of the LSTM matrix into two smaller matrices, and ficiencies experienced by RNNs. Transformers address this
the second is the division into separate classes of the LSTM constraint by applying self-attention to measure an “at-
matrix, its inputs, and its states. However, the essential re- tention ranking” in parallel for each word in a phrase or
striction of sequential computation persists. document to model the impact each word has on another.
Attention mechanisms have become an integral part of Because of this feature, Transformers allow for far more
sequence modeling in different applications, allowing de- parallelization than CNNs and RNNs, allowing very large
pendencies to be modeled regardless of their gap in the input models to be efficiently trained on large quantities of data on
or output sequences. /e following paragraph examines GPU stacks [155]. /e Transformer architecture is especially
some of the most prominent models of attention that have suitable for pretraining large text corpus, leading to sig-
created new state-of-the-art tasks for text classification. /e nificant accuracy improvements in downstream tasks, in-
study’s conclusion was largely focused on text classification cluding text classification [69]. /ey propose a novel
experiments that could not be extended to many other NLP Transformer-XL neural architecture that allows learning
tasks [150]. /e authors mentioned that attention offers a dependence to interrupt temporal coherence beyond a fixed
reliable explanation for model predictions, expecting these length. We have seen the emergence of several large-scale
properties to hold (a) attention weights should align with transformer-based pretrained language models in the cur-
feature-relevant measurements (e.g., gradient-based mea- rent scenario. As stated in Section 2.2.2, these Transformers
surements) and (b) alternative (or counterfactual) weight are pretrained to learn contextual text representations in
configurations should result in corresponding prediction much greater volumes of text corpora by predicting terms
changes. /ey stated that in the sense of text classification, that are trained on their context. /ese pretrained models
8483, 2022, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.1155/2022/1883698 by Somalia Hinari NPL, Wiley Online Library on [13/01/2025]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Computational Intelligence and Neuroscience 17

were fine-tuned using task-specific tags and in many sub- Another technique for evaluating how well our machine
sequent NLP tasks, especially text classification produced learning models perform on unknown data is cross-vali-
new state-of-the-art. Fine-tuning is supervised learning, dation. If we expose the model to entirely new, previously
while pretraining is unsupervised. unknown data, it may not be as accurate in predicting and
/e authors design the largest model, OpenGPT, 1.5 B may fail to generalize over the new data. Overfitting is the
parameter Transformer that ensures state-of-the-art results term for this issue. Because it is unable to discover patterns,
and comprises 12 layers of Transformer frames, each the model does not always train effectively on the training
composed of a masked multi-head attention unit, followed set. It would not do well on the test set in this situation.
by a standardization layer and a forward feed layer in place. Underfitting is the term for this issue. We employ cross-
With the addition of task-specific linear classifiers and fine- validation to solve overfitting issues. A cross-validation is a
tuning with task-specific tags, OpenGPT can be extended to resampling approach in which the data set is split into two
the text classification. Unlike OpenGPT that predicts words parts: training data and test data. /e model is trained using
based on previous predictions, there is another model that training data, and the model is predicted using test data that
comes into use, that is, BERT, intended to pretrain deep has yet to be observed. If the model performs well on the test
bidirectional representations from the unlabeled text by data and has a high level of accuracy, it has not overfitted the
conditioning in all layers on both the left and right context training data and may be used to forecast. K-fold cross-
together [71]. For text classification, BERT variants have validation is the most basic type of cross-validation. Other
been fine-tuned [156]. ALBERT decreases memory usage types of cross-validation include variations on k-fold cross-
and improves BERT’s training speed [157]. Another variant validation or entail repeating k-fold cross-validation rounds.
of BERT, SpanBERT [158], is a pretraining method designed /e data is initially partitioned into k equally sized segments
to accurately represent and forecast spans of text. It im- or folds in k-fold cross-validation. Following that, k iterations
proves BERT by (1) masking consecutive random spans, of training and validation are undertaken, with each iteration
rather than random tokens, and (2) training the span holding out a different fold of the data for validation and the
boundary representations to estimate the entire content of remaining k-1 folds being employed for learning [147]. For
the masked span without relying on the individual token each of the k “folds”, the following approach is used:
representations within it. Deep learning provides a way to
(i) /e folds are used as training data to train a model
manage massive volumes of processing and data with next to
no engineering by hand [23]. Unsupervised learning has had (ii) /e generated model is validated using the
a catalytic impact on the growing interest in deep learning, remaining data (i.e., it is used as a test set to compute
but the contributions of solely supervised learning have since a performance measure such as accuracy)
been overshadowed. In the longer term, we expect unsu- In cases where there is uncertainty, entropy is especially
pervised learning to become even more significant. appealing as a predictor of classification quality. It denotes
how well the class membership probabilities are distributed
3. Evaluation throughout the specified classes. Its usefulness as a predictor
of classification accuracy is predicated on the notion that in
We have discussed several supervised and unsupervised text an accurate classification, each sentence has a high likelihood
classification methods based on machine and deep learning of belonging to just one class [159]. Cross-entropy loss is a
models so far; however, the shortage of uniform data collection key indicator for evaluating the performance of a classifi-
procedures is a big issue when testing text classification cation issue. /e prediction is a probability vector, which
techniques. Even if there is a standard collection method means that it reflects the anticipated probabilities of all
available, it can generate differences in model results by simply classes, which add up to one. In a neural network, this is
selecting different training and test sets [146]. Moreover, to commonly accomplished by activating the final layer with a
compare various performance measures used during separate softmax function, but anything goes, it just has to be a
tests, there may be another difficulty related to process eval- probability vector. Maximum entropy is used in our text
uation. In general, performance metrics assess attributes of the classification scenario to estimate the conditional distribution
performance of the classification task and therefore do not of a class label given a document. A collection of word-count
necessarily present similar information. Although the under- characteristics represents a document. On a class-by-class basis,
lying mechanics of various measurement metrics vary, it is the labeled training data is utilized to estimate the anticipated
important to consider precisely what each of these metrics value of these word counts. Iterative scaling is improved to
describes and what kind of data they are attempting to express obtain a text classifier with an exponential shape that is
for comparability. Some examples of such performance compatible with the limitations of the labeled data [160].
measures include precision, recall, accuracy, microaverage,
macroaverage, and F-measure. /ese calculations are based on
a “confusion matrix” composed of true positive, false positive, 4. Comparative Analysis
false negative, and true negative [47]. Accuracy is considered
the fraction of accurate predictions in overall predictions. /e /e following section summarizes the benefits and limita-
fraction of known positives accurately estimated is referred to tions of feature extraction, feature selection methods, and
as recall. /e fraction of positives accurately estimated for all supervised and unsupervised machine and deep learning
positives is called precision. models used for a text classification task.
8483, 2022, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.1155/2022/1883698 by Somalia Hinari NPL, Wiley Online Library on [13/01/2025]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
18 Computational Intelligence and Neuroscience

Table 1: Benefits and limitations of text representation or feature extraction methods.


Method Benefits Limitations
Works well with unseen words and is easy to implement as Does not cover the syntactic and semantic relation of
Bag-of-words
it is based on the most frequent terms in a document the words, common words impact classification
In like bag-of-words approach, common words are Does not cover the syntactic and semantic relation of
TF-IDF
excluded due to IDF so does not impact the result the words
Covers the syntactic and semantic relation of the words in
Word2vec Does not cover the words’ polysemy
the text
As the same as word2vec but performs better, eliminates Does not cover the words’ polysemy and does not
GloVe
common words, trained on a large corpus work well for unseen words
Context-aware Huge memory is required for storage and does not
Covers the context or meaning of the words in the text
representation work well for unseen words

Table 2: Benefits and limitations of feature selection methods.


Method Benefits Limitations
Results into the relevance of an attribute or Biased towards multi-valued attributes and
Information gain
feature overfitting
Reduces training time and avoids
Chi-square Highly sensitive to sample size
overfitting
Univariate filter Evaluates features individually to reduce the
Fishers’ score Does not handle features redundancy
method feature set
Pearson’s correlation Is simplest and fast and measures the linear
It is only sensitive to a linear relationship
coefficient correlation between features
Removes features with variance below a Does not consider the relationship with the
Variance threshold
certain cutoff target variable
mRMR (minimal Measures the nonlinear relationship
Features may be mutually as dissimilar to
redundancy maximum between feature and target variable and
each other as possible
Multi-variate filter relevance) provides low error accuracies
method Best determines the contribution of
Multi-variate relative
individual features to the underlying Does not fit for a small sample size
discriminative criterion
dimensions
Recursive feature Considers high-quality top-N features and Computationally expensive and correlation
elimination removes weakest features of features not considered
Linear multi- Forward/backward Is computationally efficient and greedy Sometimes impossible to find features with
variate wrapper stepwise selection optimization no correlation between them
method Accommodates data set with a large
Stochastic nature and computationally
Genetic algorithm number of features and knowledge about a
expensive
problem not required
Nonlinear kernel De-emphasizes the least useful features by /e complexity of kernel computation and
Nonlinear multi- multiplicative multiplying features with a scaling factor multiplication
variate wrapper Is feasible for binary classification, based on Does not evaluate boundaries between
methods Relief nearest neighbor instance pairs and is redundant features, not suitable for the low
noise-tolerant number of training data sets
L1 regularization reduces overfitting, and it
Random selection when features are highly
LASSO can be applied when features are even more
correlated
than the data
Embedded L2 regularization is preferred over L1 when
Ridge regression Reduction of features is a challenge
methods features are highly correlated
Is better than L1 and L2 for dealing with
Elastic net highly correlated features, is flexible, and High computational cost
solves optimization problems

4.1. Comparative Analysis of Standard Text Representation or comparison, the properties of weighted terms are based on
Feature Extraction Methods. /e two major feature ex- counting words in documents and can be used as a basic
traction methods are highlighted: weighted words and word representation ranking scheme. Each approach
embedding of words. By considering their frequency and poses specific constraints. Weighted words explicitly
co-occurrence details, word embedding approaches learn quantify text similarities from the word-count space,
through sequences of words. /ese strategies are also which enhances the computational time for large vo-
unsupervised models to create word vectors. In cabulary. Although counting unique terms offers
8483, 2022, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.1155/2022/1883698 by Somalia Hinari NPL, Wiley Online Library on [13/01/2025]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Computational Intelligence and Neuroscience 19

Table 3: Benefits and limitations of the machine and deep learning model.
Model Benefits Limitations
It needs less training data; probabilistic approach Data scarcity can lead to loss of accuracy because it is
Naı̈ve Bayes handles continuous and discrete data; and it is not based on assumption that any two features are
sensitive to irrelevant features, easily updatable independent given the output class.
It is possible to apply to unstructured data also such as It needs long training time on large data sets and is
SVM text, images, and so on; kernel provides strength to the difficult to choose good kernel function, and choosing
algorithm and can work for high-dimensional data key parameters varies from problem to problem.
It can be implemented for classification and regression Cost is high for computing distance for each instance;
problems and produces the best results if large training finding attributes for distance-based learning is quite a
KNN
data is available or even noisy training data, preferred difficult task; imbalanced data causes problems; and no
for multi-class problems treatment is required for missing value.
It reduces ambiguity in decision-making; implicitly
It is unstable due to the effect of changes in data requires
performs feature selection, easy representation, and
Decision tree changes in the whole structure, is not suitable for
interpretation; and requires fewer efforts for data
continuous values, and causes overfitting problem.
preparation
It is highly interpretable and prediction accuracy is
improved. It can model feature interactions and /ese are computationally expensive and frequently
Boosted decision tree execute feature selection on its own. Gradient boosted need a large number of trees (>1,000), which can take a
trees are less prone to overfitting since they are trained long time and consume a lot of memory.
on a randomly selected subset of the training data.
In contrast to other methods, clusters of decision trees More trees in random forests increase the time
Random forest are very easy to train, and the preparation and complexity in the prediction stage, and high chances of
preprocessing of the input data do not require. overfitting occur.
It provides fast predictions, is best suited for a large
Computationally expensive requires a large data set for
CNN volume of data, and requires no human efforts for
training.
feature design.
It implements feedback model so considers best for Training of model is difficult and takes a long time to find
RNN time series problems and makes accurate predictions nonlinearity in data, and gradient vanishing problem
than other ANN models. occurs.
Adds short- and long-term memory components into
It is expensive and complex due to the backpropagation
RNN so it considers best for applications that have a
model, increases the dimensionality of the problem, and
sequence and uses for solving NLP problems such as
LSTM, Bi-LSTM makes it harder to find the optimal solution. Since Bi-
text classification and text generation, and computing
LSTM has double LSTM cells so it is expensive to
speed is high. Bi-LSTM solves the issue of predicting
implement.
fixed sequence to sequence.
In natural language processing, GRUs learn quicker
and perform better than LSTMs on less training data.
As it requires fewer training parameters. GRUs are Slow convergence and limited learning efficiency are still
Gated RNN (GRU)
simpler and hence easier to modify and do not need issues with GRU.
memory units, such as by adding extra gates if the
network requires more input.
/e issue with RNNs and CNNs is that when sentences
are too long, they are not able to keep up with context
and content. By paying attention to the word that is
currently being operated on this limitation was
Transformer with an
resolved, the attention strategy is an effort to selectively At inference time, it is strongly compute-intensive.
attention mechanism
concentrate on a few important items while avoiding
those in deep neural networks to execute the same
operation, enabling much more parallelization than
RNNs and thus reduces training times.

independent confirmation of similarity, semantic com- 4.2. Comparative Analysis of Standard Feature Selection
parisons between words are not taken into consideration Methods. Some studies have preferred Fisher’s linear dis-
[148]. Word embedding techniques solve this challenge criminant for the support-vector-based feature ranking
but are constrained by the need for a large corpus of text model [99, 100]. /e author has mentioned that the filter-
data sets to train [21]. /erefore, researchers tend to use based method provides distinguish feature selector that has
vectors with pretrained word embedding [149]. Table 1 further selected distinctive features that possess term char-
presents the advantages and limitations of each technique acteristics during the elimination of uninformative ones [92].
of text representation or feature extraction. It offered performance by decreasing processing time and
8483, 2022, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.1155/2022/1883698 by Somalia Hinari NPL, Wiley Online Library on [13/01/2025]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
20 Computational Intelligence and Neuroscience

increasing classification accuracy. Another technique in the of training data manually, and the rest data can be trained
wrapper method has gained popularity fast forward selection using machine learning algorithms [163].
technique for selecting the best subset of a feature that Apart from conventional methods used for representing
demanded less computational effort as compared to other data sets for extracting patterns from the text data using
methods [107]. In the genetic algorithmic approach, a chaos vector representation based on word-embedding and par-
genetic algorithm was proposed to simplify the feature se- agraph-embedding [38, 58], the authors have presented deep
lection method and obtained higher accuracy of classification neural network models for NLP-based applications using
technique [108, 109, 161]. Embedded methods were preferred character-embedding [164], unsupervised techniques based
over wrappers in many studies. Researchers mentioned that on transfer learning, Bert, with fine-tuning with domain-
embedded methods have performed better than wrapper specific data [156]. However, these representations and
computationally, but these algorithms perform selection globally available representations cannot be generalized to
features as a subpart of the learning technique. Furthermore, unseen texts, which are very specific to a particular domain
these were followed by hybrid feature selection approaches [59]. It offers opportunities to design methodology for
where both filter and wrapper methods were combined, and extending the vocabulary of existing representations for
these approaches were proven more computationally effective specific domains.
than the performance of a single selection technique. It was /e deep learning algorithms are proven good in de-
observed by the researchers that sometimes original features cision-making for NLP-based applications, and these models
may not be optimal dimensions for text document repre- cannot handle symbols directly [165]. Also, the computa-
sentation. /ey provided text feature extraction methods to tional cost of training such algorithms is very high. It offers
solve the problem by creating new feature space or artificial scope for designing deep neural network-based architecture,
terms [117]. Table 2 presents the benefits and limitations of which can be inputted with linguistic knowledge, lexical
feature selection methods. knowledge, and word knowledge from different domains.

6. Conclusions
4.3. Comparison of State-of-the-Art Machine and Deep
Learning Models for Text Classification. /e performance of In this paper, we have provided a detailed review of the
the classifier depends on the selection of feature selection complete process of the text classification system. /is paper
and extraction method. /e supervised and unsupervised covered various algorithms or methods used in subtasks of
machine learning techniques have offered a variety of classification. It has presented the techniques for data col-
classifiers that performed well in a variety of domain-specific lection from several online sources. /e documents were
classification problems. /e following Table 3 presents their represented with basic techniques and followed by recent
pros and cons. Recent studies have focused on deep learning research in document representation for different areas of
or neural network-based classifiers such as CNN, RNN, and machine learning and natural language processing. To
RNN with LSTM that have shown better results than provide suitable and fast classifiers the higher dimensional
conventional algorithms such as SVM and KNN in solving a space of data was reduced to a lower space using feature
different range of problems. RNN and LSTM have been used selection and feature extraction methods. Different algo-
in many NLP applications due to their capacity of capturing rithms perform differently depending on the domain-spe-
complex patterns within the text [151]. It has also been used cific data collections and to train machine learning text
in sequence labeling tasks in POS (part of speech) activity. classifiers. /e authors have used these algorithms based on
But it has offered a lot of future scope in resolving com- the problem statement, and none of the algorithms has
plexities involved in the backpropagation technique used in proven perfect for all types of problems and data
RNN and making the learning model cheaper and faster dimensionality.
[104]. It is observed that in recent years, some studies focused
on new applications of text classification such as multi-label
5. Research Gap and Future Direction classification [166, 167], and hierarchical classification
[168, 169] in the field of natural language processing or
From this review of existing studies, we identified that there machine translation and medical sciences, respectively. /e
exist certain gaps, which we can plan to fill in the future. neural network-based algorithms are commonly used in
While labeling unstructured text data manually, it takes a lot NLP-based problems [136, 137, 142]. However, these al-
of time to understand the data to categorize it. Also, it needs gorithms mainly focus on generic data. Moreover, CNN is
specialists for understanding domain-specific data. In such preferably used for image processing, and RNN with LSTM
scenarios, the machine learning algorithms do not produce is preferred for time-series problems. Initially, these algo-
the expected accuracy [12]. Extraction of meaning or finding rithms were not applied to text data, but in recent years,
semantic relations between words of unstructured data is a CNN with character-embedding is being used for document
complex task, which has been tried by several authors using representation and feature selection in text documents [58].
NLP techniques for years [24, 162]. However, these methods /e transformers-based unsupervised models also overcome
prove inefficient when pursuing building a high-quality the issue of RNN and LSTM, but these techniques are proven
classification system. It offers opportunities to design computationally expensive. With the advancement of deep
semisupervised machine learning models to label some parts neural networks, we may find in the future that these deep
8483, 2022, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.1155/2022/1883698 by Somalia Hinari NPL, Wiley Online Library on [13/01/2025]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Computational Intelligence and Neuroscience 21

neural networks will be applied efficiently in the automatic [11] G. Yang, M. A. Jan, A. U. Rehman, M. Babar, M. M. Aimal,
monitoring of web-based text data and classifying unseen and S. Verma, “Interoperability and data storage in Internet
data into automated labels [7]. of multimedia things: investigating current trends, research
challenges and future directions,” IEEE Access, vol. 8, Article
ID 124382, 2020.
Data Availability [12] A. Gandomi and M. Haider, “Beyond the hype: big data
concepts, methods, and analytics,” International Journal of
/e data used in this study are publicly available. Information Management, vol. 35, no. 2, pp. 137–144, 2015.
[13] Y. Sun, A. K. C. Wong, and M. S. Kamel, “Classification of
Conflicts of Interest imbalanced data: a review,” International Journal of Pattern
Recognition and Artificial Intelligence, vol. 23, no. 4,
/e authors declare that there are no conflicts of interest. pp. 687–719, Lyon, France, 2009.
[14] S. Manne and S. S. Fatima, “A novel approach for text
categorization of unorganized data based with information
Acknowledgments extraction,” International Journal of Computational Science
and Engineering, vol. 3, pp. 2846–2854, 2011.
Jana Shafi would like to thank the Deanship of Scientific [15] B. S. Harish, D. S. Guru, and S. Manjunath, “Representation
Research, Prince Sattam bin Abdul Aziz University, for and classification of text documents: a brief review,” IJCA,
supporting this work. /is work was supported by the Na- Spec Issue Recent Trends Image Process Pattern Recognit,
tional Research Foundation of Korea (NRF) grant funded by pp. 110–119, 2010.
the Korean government (MSIT; No. 2022R1C1C1004590). [16] Y. Liang, Y. Liu, C. K. Kwong, and W. B. Lee, “Learning the
“Whys”: discovering design rationale using text mining - an
algorithm perspective,” Computer-Aided Design, vol. 44,
References no. 10, pp. 916–930, 2012.
[1] A. Atkins, M. Niranjan, and E. Gerding, “Financial news [17] B. Liu, X. Li, W. S. Lee, and P. S. Yu, “Text classification by
predicts stock market volatility better than close price,” Me labeling words,” Artificial Intelligence, vol. 34, pp. 425–430,
Journal of Finance and Data Science, vol. 4, no. 2, pp. 120– 2004.
137, 2018. [18] D. Y. Zhou, O. Bousquet, T. N. Lal, J. Weston, and
[2] J. Robinson, A. Glean, and W. Moore, “How does news B. Schölkopf, “Learning with local and global consistency,”
impact on the stock prices of green firms in emerging Advances in Neural Information Processing Systems, vol. 16,
markets?” Research in International Business and Finance, pp. 321–328, 2004.
vol. 45, pp. 446–453, 2018. [19] H. U. Rehman, R. Rafique, and M. C. M. Nasir, “Forecasting
[3] G. Löffler, L. Norden, and A. Rieber, Salient News and the CO2 emissions from energy, manufacturing and transport
Stock Market Impact of Tone in Rating Reports, Ssrn, sectors in Pakistan: statistical vs. Machine learning methods,”
Rochester, NY, USA, pp. 1–41, 2016. Mach Learn Methods, vol. 28, p. 21, 2017.
[4] M. N. Elagamy, C. Stanier, and B. Sharp, “Stock market [20] N. Kundu, G. Rani, V. S. Dhaka et al., “IoT and interpretable
random forest-text mining system mining critical indicators machine learning based framework for disease prediction in
of stock market movements,” in Proceedings of the 2018 pearl millet,” Sensors, vol. 21, no. 16, p. 5386, 2021.
Second International Conference on Natural Language and [21] Y. Bengio, R. Ducharme, and P. Vincent, “A neural prob-
Speech Processing (ICNLSP), pp. 1–8, IEEE, Algiers Algeria, abilistic language model,” Advances in Neural Information
June 2018. Processing Systems, vol. 3, pp. 1–11, 2001.
[5] Y. Zhang, Y. Dang, H. Chen, M. /urmond, and C. Larson, [22] S. Hochreiter and J. Schmidhuber, “Long short-term mem-
“Automatic online news monitoring and classification for ory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997.
syndromic surveillance,” Decision Support Systems, vol. 47, [23] Y. Lecun, Y. Bengio, and G. Hinton, “Deep learning,” Nature,
no. 4, pp. 508–517, 2009. vol. 521, no. 7553, pp. 436–444, 2015.
[6] J. C. West, D. E. Clarke, F. F. Duffy et al., “Availability of [24] A. Ittoo, L. M. Nguyen, and A. Van Den Bosch, “Text an-
mental health services prior to health care reform insurance alytics in industry: challenges, desiderata and trends,”
expansions,” Psychiatric Services, vol. 67, no. 9, pp. 983–989, Computers in Industry, vol. 78, pp. 96–107, 2016.
2016. [25] F. Da Costa Albuquerque, M. A. Casanova, J. A. F. De Macedo,
[7] M. Ali, S. Khalid, M. I. Rana, and F. Azhar, “A probabilistic M. T. M. de Carvalho, and C. Renso, “A proactive application
framework for short text classification,” in Proceedings of the to monitor truck fleets,” in Proceedings of the 2013 IEEE
IEEE Eighth Annu Comput Commun Work Conf CCWC Fourteenth International Conference on Mobile Data Man-
2018, pp. 742–747, Las Vegas, NV, USA, February 2018. agement, vol. 1, pp. 301–304, IEEE, Milan, Italy, July 2013.
[8] C. Romero and S. Ventura, “Educational data mining: a [26] S. Schmidt, S. Schnitzer, and C. Rensing, “Text classification
survey from 1995 to 2005,” Expert Systems with Applications, based filters for a domain-specific search engine,” Computers
vol. 33, no. 1, pp. 135–146, 2007. in Industry, vol. 78, pp. 70–79, 2016.
[9] Q. Ye, Z. Zhang, and R. Law, “Sentiment classification of [27] A. Hussain, S. Nazir, F. Khan et al., “A resource efficient
online reviews to travel destinations by supervised machine hybrid proxy mobile IPv6 extension for next generation IoT
learning approaches,” Expert Systems with Applications, networks,” IEEE Internet of Mings Journal, p. 1, 2021.
vol. 36, no. 3, pp. 6527–6535, 2009. [28] M. Kumar, P. Mukherjee, K. Verma, S. Verma, and
[10] D. /orleuchter and D. Van Den Poel, “Predicting e-com- D. B. Rawat, “Improved deep convolutional neural network
merce company success by mining the text of its publicly- based malicious node detection and energy-efficient data
accessible website,” Expert Systems with Applications, vol. 39, transmission in wireless sensor networks,” IEEE Transactions
no. 17, Article ID 13026, 2012. on Network Science and Engineering, p. 1, 2021.
8483, 2022, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.1155/2022/1883698 by Somalia Hinari NPL, Wiley Online Library on [13/01/2025]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
22 Computational Intelligence and Neuroscience

[29] P. Rani, V. S. Kavita, S. Verma, and G. N. Nguyen, “Miti- Cybernetics, vol. 2, pp. 944–946, IEEE, New York, NY, USA,
gation of black hole and gray hole attack using swarm in- November 2002.
spired algorithm with artificial neural network,” IEEE Access, [46] E. Montañés, I. Dı́az, J. Ranilla, E. Combarro, and
vol. 8, Article ID 121755, 2020. J. Fernandez, “Scoring and selecting terms for text catego-
[30] B. Billal, A. Fonseca, and F. Sadat, “Named entity recognition rization,” IEEE Intelligent Systems, vol. 20, no. 3, pp. 40–47,
and hashtag decomposition to improve the classification of 2005.
tweets,” in Proceedings of the Second Workshop on Noisy User [47] Z. Li, S. Verma, and M. Jin, “Power allocation in massive
generated Text, pp. 64–73, Osaka, Japan, December 2016. MIMO-HWSN based on the water-filling algorithm,” Wireless
[31] R. P. Schumaker and H. Chen, “A quantitative stock pre- Communications and Mobile Computing, vol. 2021, Article ID
diction system based on financial news,” Information Pro- 8719066, 11 pages, 2021.
cessing & Management, vol. 45, no. 5, pp. 571–583, 2009. [48] Y. Bengio, R. Ducharme, V. Pascal, and J. Christen, “A neural
[32] Y. Chen, Z. Li, L. Nie, X. Hu, and X. Wang, “Supervised probabilistic language model,” Journal of Machine Learning
bayesian network model for microblog topic classification,” Research, vol. 3, pp. 1137–1155, 2003.
vol. 1, pp. 561–576, 2012. [49] Y. Xiong, S. Chen, H. Qin et al., “Distributed representation
[33] R. Xia, C. Zong, X. Hu, and E. Cambria, “Feature ensemble and one-hot representation fusion with gated network for
plus sample selection: domain adaptation for sentiment clinical semantic textual similarity,” BMC Medical Infor-
classification,” in Proceedings of the IJCAI Int Jt Conf Artif matics and Decision Making, vol. 20, no. S1, pp. 72–77, 2020.
Intell 2015, pp. 4229–4233, AAAI Press, California, CA, USA, [50] K. Kowsari, K. Jafari Meimandi, M. Heidarysafa, S. Mendu,
January 2015. L. Barnes, and D. Brown, “Text classification algorithms: a
[34] N. Majumder, S. Poria, A. Gelbukh, E. Cambria, and survey,” Information, vol. 10, no. 4, pp. 150–168, 2019.
E. Cambria, “Deep learning-based document modeling for [51] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient
personality detection from text,” IEEE Intelligent Systems, estimation of word representations in vector space,” in
vol. 32, no. 2, pp. 74–79, 2017. Proceedings of the First Int Conf Learn Represent ICLR 2013 -
[35] E. Cambria, “Affective computing and sentiment analysis,” Work Track Proc 1–12, Scottsdale, Arizona, May 2013.
IEEE Intelligent Systems, vol. 31, no. 2, pp. 102–107, 2016. [52] B. Liu, “Text sentiment analysis based on CBOW model and
[36] L. Gaur, G. Singh, A. Solanki et al., “Disposition of Youth in deep learning in big data environment,” Journal of Ambient
Predicting Sustainable Development Goals Using the Neuro- Intelligence and Humanized Computing, vol. 11, no. 2,
Fuz,” Human-Centric Computing and Information Sciences, pp. 451–458, 2020.
vol. 11, pp. 2192–1962, 2021. [53] H. Schwenk, “Continuous space language models,” Com-
[37] R. H. W. Pinheiro, G. D. C. Cavalcanti, and I. R. Tsang,
puter Speech & Language, vol. 21, no. 3, pp. 492–518, 2007.
“Combining dissimilarity spaces for text categorization,” In- [54] T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean,
formation Sciences, vol. 406-407, pp. 87–101, 2017.
“Distributed representations of words and phrases and their
[38] T. Sabbah, A. Selamat, M. H. Selamat et al., “Modified
compositionality,” in Advances in Neural Information Pro-
frequency-based term weighting schemes for text classifi-
cessing Systems 26, C. J. C. Burges, L. Bottou, M. Welling,
cation,” Applied Soft Computing, vol. 58, pp. 193–206, 2017.
Z. Ghahramani, and K. Q. Weinberger, Eds., pp. 3111–3119,
[39] V. Dogra, S. Verma, K. Verma, N. Ghosh, U. Le, and
Curran Associates Inc, Red Hook, NY, USA, 2013.
D.-N. au, “A comparative analysis of machine learning
[55] P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov,
models for banking news extraction by multiclass classifi-
“Enriching word vectors with subword information,” Trans-
cation with imbalanced datasets of financial news: challenges
and solutions,” International Journal of Interactive Multi- actions of the Association for Computational Linguistics, vol. 5,
media and Artificial Intelligence, vol. 7, no. 3, p. 35, 2022. pp. 135–146, 2017.
[40] Y. Zhao, B. Qin, and T. Liu, “Creating a fine-grained corpus [56] J. Pennington and C. D. M. Richard Socher, “GloVe: global
for Chinese sentiment analysis,” IEEE Intelligent Systems, vectors for word representation,” in Proceedings of the 2014
vol. 30, no. 1, pp. 36–43, 2015. Conf Empir Methods Nat Lang Process (EMNLP), pp. 1532–
[41] Y. Zhang, R. Jin, and Z. H. Zhou, “Understanding bag-of- 1543, Doha, Qatar, October 2014.
words model: a statistical framework,” International Journal [57] C.-P. Wei and Y.-H. Lee, “Event detection from online news
of Machine Learning and Cybernetics, vol. 1, no. 1-4, documents for supporting environmental scanning,” Deci-
pp. 43–52, 2010. sion Support Systems, vol. 36, no. 4, pp. 385–401, 2004.
[42] T. Joseph and Y. B. Lev Ratinov, “Word representations: a [58] Q. Zhu, X. Li, A. Conesa, and C. Pereira, “GRAM-CNN: a
simple and general method for semi-supervised learning,” in deep learning approach with local context for named entity
Proceedings of the Forty Eighth Annu Meet Assoc Comput recognition in biomedical text,” Bioinformatics, vol. 34, no. 9,
Linguist Assoc Comput Linguist, pp. 384–394, AAAI Press, pp. 1547–1554, 2018.
California, CA, USA, July 2010. [59] V. Prokhorov, M. T. Pilehvar, P. Lio, and N. Collier, “Unseen
[43] R. Silipo and K. Melcher, Text Encoding: A Review, word representation by aligning heterogeneous lexical se-
Kdnuggets, 2019, https://www.kdnuggets.com/2019/11/text- mantic spaces,” in Proceedings of the Mirty-Mird AAAI
encoding-review.html. Conference on Artificial Intelligence and Mirty-First Inno-
[44] Z. Sivic and Zisserman, “Video Google: a text retrieval ap- vative Applications of Artificial Intelligence Conference and
proach to object matching in videos,” in Proceedings of the Ninth AAAI Symposium on Educational Advances in Arti-
Ninth IEEE International Conference on Computer Vision, ficial Intelligence, Honolulu, HI, USA, February 2019.
vol. 2, pp. 1470–1477, IEEE, New York, NY, USA, October [60] A. Makazhanov and D. Rafiei, “Predicting political prefer-
2003. ence of Twitter users,” in Proceedings of the 2013 IEEE/ACM
[45] Li-P. Jing, H.-K. Huang, and H.-Bo Shi, “Improved Feature International Conference on Advances in Social Networks
Selection Approach TFIDF in Text Mining,” in Proceedings of Analysis and Mining, pp. 298–305, Association for Com-
the International Conference on Machine Learning and puting Machinery, New York NY USA, August2013.
8483, 2022, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.1155/2022/1883698 by Somalia Hinari NPL, Wiley Online Library on [13/01/2025]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Computational Intelligence and Neuroscience 23

[61] C. Bosco, V. Patti, and A. Bolioli, “Developing corpora for [78] I. Yamada and H. Shindo, “Aip R Neural Attentive Bag-Of-
sentiment analysis: the case of irony and senti-TUT,” IEEE Entities Model for Text Classification,” arXiv:1909.01259,
Intelligent Systems, vol. 28, no. 2, pp. 55–63, 2013. 2019.
[62] M. E. Peters, M. Neumann, M. Iyyer et al., “Deep contex- [79] T. B. Brown, B. Mann, N. Ryder et al., “Language models are
tualized word representations,” in Proceedings of the 2018 few-shot learners,” in Proceedings of the Adv Neural Inf
Conference of the North American Chapter of the Association Process Syst 2020, December 2020.
for Computational Linguistics: Human Language Technolo- [80] B. Magnini, C. Strapparava, G. Pezzulo, and A. Gliozzo, “/e
gies, vol. 1, pp. 2227–2237, New Orleans, LA, USA, June 2018. role of domain information in Word Sense Disambiguation,”
[63] S. Minaee, N. Kalchbrenner, E. Cambria, N. Nikzad, Natural Language Engineering, vol. 8, no. 4, pp. 359–373,
M. Chenaghlu, and J. Gao, “Deep learning--based text clas- 2002.
sification,” ACM Computing Surveys, vol. 54, no. 3, pp. 1–40, [81] M. Bilenko, R. Mooney, W. Cohen, P. Ravikumar, and
2022. S. Fienberg, “Adaptive name matching in information in-
[64] X. Zhu, P. Sobihani, and H. Guo, “Long short-term memory tegration,” IEEE Intelligent Systems, vol. 18, no. 5, pp. 16–23,
over recursive structures,” in Proceedings of the 32nd In- 2003.
ternational Conference on International Conference on Ma- [82] S. J. Huang, W. Gao, and Z. H. Zhou, “Fast multi-instance
chine Learning, pp. 1604–1612, Lille, France, July 2015. multi-label learning,” IEEE Transactions on Pattern Analysis
[65] P. Zhou, Z. Qi, S. Zheng, J. Xu, and H. Bao, “Text classifi- and Machine Intelligence, vol. 41, no. 11, pp. 2614–2627,
cation improved by integrating bidirectional lstm with two- 2019.
dimensional max pooling,” 2016, https://arxiv.org/abs/1611. [83] M.-L. Zhang and Z.-H. Zhou, “ML-KNN: a lazy learning
06639. approach to multi-label learning,” Pattern Recognition,
[66] W. Wei, J. Li, L. Cao, Y. Ou, and J. Chen, “Effective detection vol. 40, no. 7, pp. 2038–2048, 2007.
of sophisticated online banking fraud on extremely imbal- [84] J. Jun Yan, B. Benyu Zhang, N. Ning Liu et al., “Effective and
anced data,” World Wide Web, vol. 16, no. 4, pp. 449–475, efficient dimensionality reduction for large-scale and streaming
2013. data preprocessing,” IEEE Transactions on Knowledge and Data
[67] S. I. Alfuraih, N. T. Sui, and D. McLeod, “Using Trusted Engineering, vol. 18, no. 3, pp. 320–333, 2006.
Email to Prevent Credit Card Frauds in Multimedia Prod- [85] X. Xu, T. Liang, J. Zhu, D. Sun, and T. au, “Review of classical
ucts,” World Wide Web, vol. 5, pp. 245–256, 2004. dimensionality reduction and sample selection methods for
[68] D. Yan and S. Guo, “Leveraging contextual sentences for text
large-scale data processing,” Neurocomputing, vol. 328,
classification by using a neural attention model,” Compu-
pp. 5–15, 2019.
tational Intelligence and Neuroscience, vol. 2019, Article ID
[86] N. Armanfard, J. P. Reilly, and M. Komeili, “Local feature
8320316, 11 pages, 2019.
selection for data classification,” IEEE Transactions on Pat-
[69] Z. Dai, Z. Yang, Y. Yang, C. Jaime, V. L. Quoc, and S. Ruslan,
tern Analysis and Machine Intelligence, vol. 38, no. 6,
“Transformer-XL: Attentive language models beyond a fixed-
pp. 1217–1227, 2016.
length context,” in Proceedings of the ACL 2019 - Fifty Seventh
[87] S. Pölsterl, S. Conjeti, N. Navab, and A. Katouzian, “Survival
Annu Meet Assoc Comput Linguist Proc Conf, pp. 2978–
analysis for high-dimensional, heterogeneous medical data:
2988arXiv:1901.02860, Florence, Italy, July 2020.
exploring feature extraction as an alternative to feature se-
[70] Y. Hu, J. Ding, Z. Dou, and H. Chang, “Short-text classifi-
cation detector: a bert-based mental approach,” Computa- lection,” Artificial Intelligence in Medicine, vol. 72, pp. 1–11,
tional Intelligence and Neuroscience, vol. 2022, Article ID 2016.
8660828, 11 pages, 2022. [88] D. Mladenić and M. Grobelnik, “Feature selection on hi-
[71] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: erarchy of web documents,” Decision Support Systems,
Pre-training of Deep Bidirectional Transformers for Language vol. 35, no. 1, pp. 45–87, 2003.
Understanding,” 2018, https://arxiv.org/abs/1810.04805. [89] A.-C. Haury, P. Gestraud, and J.-P. Vert, “/e influence of
[72] J. Howard and S. Ruder, “Universal Language Model Fine- feature selection methods on accuracy, stability and inter-
tuning for Text Classification,” 2018, https://arxiv.org/abs/ pretability of molecular signatures,” PLoS One, vol. 6, no. 12,
1801.06146. Article ID e28210, 2011.
[73] Y. Sun, S. Wang, Y. Li et al., “Enhanced representation [90] J. Yang, Y. Liu, X. Zhu, Z. Zhang, Z. Liu, and X. Zhang, “A
through knowledge integration,” 2019, https://arxiv.org/abs/ new feature selection based on comprehensive measurement
1904.09223. both in inter-category and intra-category for text categori-
[74] M. T. Varun Dogra Assvknj and, “Analyzing DistilBERT for zation,” Information Processing & Management, vol. 48,
sentiment classification of banking financial news,” in In- no. 4, pp. 741–754, 2012.
telligent Computing and Innovation on Data Science, [91] I. Inza, P. Larrañaga, R. Blanco, and A. J. Cerrolaza, “Filter
S.-L. Peng, S.-Y. Hsieh, S. Gopalakrishnan, and Balaganesh, versus wrapper gene selection approaches in DNA micro-
Eds., p. 582, Springer Singapore, Singapore, 2021. array domains,” Artificial Intelligence in Medicine, vol. 31,
[75] V. Dogra, S. Verma, and A. Singh, “Banking news-events no. 2, pp. 91–103, 2004.
representation and classification with a novel hybrid model [92] A. K. Uysal and S. Gunal, “A novel probabilistic feature
using DistilBERT and rule-based features,” Computer Sci- selection method for text classification,” Knowledge-Based
ence, vol. 12, pp. 3039–3054, 2021. Systems, vol. 36, pp. 226–235, 2012.
[76] V. Sanh, L. Debut, J. Chaumond, and T. Wolf, “DistilBERT, a [93] D. Roobaert, G. Karakoulas, and N. V. Chawla, “Information
distilled version of BERT: smaller, faster, cheaper and gain, correlation and support vector machines,” Feature
lighter,” pp. 2–6, 2019, https://arxiv.org/abs/1910.01108. Extraction, vol. 470, pp. 463–470, 2008.
[77] Z. Ye, Q. Guo, Q. Gan, X. Qiu, and Z. Zhang, “BP-trans- [94] S. Lei, “A feature selection method based on information
former: modelling long-range context via binary partition- gain and genetic algorithm,” in Proceedings of the 2012 In-
ing,” 2019, https://arxiv.org/abs/1911.04070. ternational Conference on Computer Science and Electronics
8483, 2022, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.1155/2022/1883698 by Somalia Hinari NPL, Wiley Online Library on [13/01/2025]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
24 Computational Intelligence and Neuroscience

Engineering, vol. 2, pp. 355–358, Hangzhou, China, March of the BISC Flint-CIBI 2003 Workshop, pp. 1–11, Berkeley,
2012. CA, 2003.
[95] X. Jin, A. Xu, R. Bie, and P. Guo, “Machine learning [111] R. J. Urbanowicz, M. Meeker, W. L Cava, R. S. Olson, and
techniques and chi-square feature selection for cancer J. H. Moore, “Relief-based feature selection: introduction
classification using SAGE gene expression profiles,” Lecture and review,” Journal of Biomedical Informatics, vol. 85,
Notes in Computer Science, Springer, Berlin, Germany, pp. 189–203, 2018.
pp. 106–115, 2006. [112] R. J. Urbanowicz, R. S. Olson, P. Schmitt, M. Meeker, and
[96] Y.-T. Chen and M. C. Chen, “Using chi-square statistics to J. H. Moore, “Benchmarking relief-based feature selection
measure similarities for text categorization,” Expert Systems methods for bioinformatics data mining,” Journal of Bio-
with Applications, vol. 38, no. 4, pp. 3085–3090, 2011. medical Informatics, vol. 85, pp. 168–188, 2018.
[97] Q. Gu, Z. Li, and J. Han, “Generalized Fisher score for feature [113] N. Mimouni and T. Y. Yeung, “Comparing Performance of
selection,” A brief review of Fisher score. Ratio, AUAI Press, Text Pre-processing Methods for Predicting a Binary Posi-
Arlington, VA, USA, 2010. tion by LASSO Experiment with Textual Data of European
[98] J. Weston, S. Mukherjee, O. Chapelle, M. Pontil, T. Poggio, Union Public Consultation,” in Proceedings of the Workshop
and V. Vapnik, “Feature selection for SVMs,” Advances in on Women in Data Science 2018, pp. 18–21, 2018.
Neural Information Processing Systems, vol. 13, pp. 668–674, [114] B. J. Marafino, W. J. Boscardin, and R. A Dudley, “Efficient
2000. and sparse feature selection for biomedical text classification
[99] E. Youn, L. Koenig, M. K. Jeong, and S. H. Baek, “Support via the elastic net: application to ICU risk stratification from
vector-based feature selection using Fisher’s linear dis- nursing notes,” Journal of Biomedical Informatics, vol. 54,
criminant and Support Vector Machine,” Expert Systems pp. 114–120, 2015.
with Applications, vol. 37, no. 9, pp. 6148–6156, 2010. [115] P. Taylor, A. E. Hoerl, and R. W. Kennard, Technometrics
[100] Y. Wang and X.-J. Wang, “A new approach to feature selection Ridge Regression: Biased Estimation for Nonorthogonal
in text classification,” in Proceedings of the 2005 International Problems Ridge Regression : Biased Estimation Nonorthogonal
Conference on Machine Learning and Cybernetics, vol. 6, Problems, pp. 37–41, 2012.
pp. 3814–3819, Guangzhou, China, August 2005. [116] H. Zou and T. Hastie, “Regularization and variable selection
[101] A. Genkin, D. D. Lewis, and D. Madigan, “Large-scale via the elastic net,” Journal of the Royal Statistical Society:
bayesian logistic regression for text categorization,” Tech- Series B, vol. 67, no. 2, pp. 301–320, 2005.
[117] F. Sebastiani, “Machine learning in automated text catego-
nometrics, vol. 49, no. 3, pp. 291–304, 2007.
[102] H. Peng, F. Long, and C. Ding, “Feature selection based on rization,” ACM Computing Surveys, vol. 34, no. 1, pp. 1–47,
2002.
mutual information criteria of max-dependency, max-rele-
[118] J. C. Gomez and M. F. Moens, “PCA document recon-
vance, and min-redundancy,” IEEE Transactions on Pattern
struction for email classification,” Computational Statistics &
Analysis and Machine Intelligence, vol. 27, no. 8, pp. 1226–
Data Analysis, vol. 56, no. 3, pp. 741–751, 2012.
1238, 2005.
[119] D. M. Blei, A. Y. Ng, and M. I. Jordan, Latent Dirichlet
[103] M. Labani, P. Moradi, F. Ahmadizar, and M. Jalili, “A novel
Allocation, vol. 3, pp. 993–1022, 2003.
multivariate filter method for feature selection in text clas-
[120] C. Ordun, S. Purushotham, and E. Raff, “Exploratory
sification problems,” Engineering Applications of Artificial
Analysis of Covid-19 Tweets Using Topic Modeling, UMAP,
Intelligence, vol. 70, pp. 25–37, 2018. and DiGraphs,” 2020, https://arxiv.org/abs/2005.03082.
[104] K. A. Norman, S. M. Polyn, G. J. Detre, and J. V. Haxby, [121] L. Mcinnes, J. Healy, and J. Melville, “UMAP: Uniform
“Beyond mind-reading: multi-voxel pattern analysis of fMRI Manifold Approximation and Projection for Dimension
data,” Trends in Cognitive Sciences, vol. 10, no. 9, pp. 424–430, Reduction,” 2020, https://arxiv.org/abs/1802.03426.
2006. [122] H. U. Rehman, M. Shafiq, S. Baig, and U. Manzoor, “An-
[105] L. Wang, Y. Lei, Y. Zeng, L. Tong, and B. Yan, “Principal alyzing the epidemiological outbreak of COVID-19,” A Vis
feature analysis: a novel voxel selection method for fMRI Explor data Anal approach J Med Virol, vol. 92, no. 6, 2021.
data,” Computational and Mathematical Methods in Medi- [123] H. Cheng and R. Yu, Text Classification Model Enhanced by
cine, vol. 2013, Article ID 645921, 7 pages, 2013. Unlabeled Data for LaTeX Formula, 2021.
[106] P. M. Granitto, C. Furlanello, F. Biasioli, and F. Gasperi, [124] J. Chen, H. Huang, S. Tian, and Y. Qu, “Feature selection for
“Recursive feature elimination with random forest for PTR- text classification with Naı̈ve Bayes,” Expert Systems with
MS analysis of agroindustrial products,” Chemometrics and Applications, vol. 36, no. 3, pp. 5432–5435, 2009.
Intelligent Laboratory Systems, vol. 83, no. 2, pp. 83–90, 2006. [125] B. Trstenjak, S. Mikac, and D. Donko, “KNN with TF-IDF
[107] K. Z. Mao, “Fast orthogonal forward selection algorithm for based framework for text categorization,” Procedia Engi-
feature subset selection,” IEEE Transactions on Neural neering, vol. 69, pp. 1356–1364, 2014.
Networks, vol. 13, no. 5, pp. 1218–1224, 2002. [126] C. Lee and G. G. Lee, “Information gain and divergence-
[108] H. Chen, W. Jiang, C. Li, and R. Li, “A heuristic feature based feature selection for machine learning-based text cat-
selection approach for text categorization by using chaos egorization,” Information Processing & Management, vol. 42,
optimization and genetic algorithm,” Mathematical Prob- no. 1, pp. 155–165, 2006.
lems in Engineering, vol. 2013, Article ID 524017, 6 pages, [127] L. Khreisat, “A machine learning approach for Arabic text
2013. classification using N-gram frequency statistics,” Journal of
[109] R. Leardi, R. Boggia, and M. Terrile, “Genetic algorithms as a Informetrics, vol. 3, no. 1, pp. 72–77, 2009.
strategy for feature selection,” Journal of Chemometrics, [128] J. Liu, T. Jin, K. Pan, Y. Yang, Y. Wu, and X. Wang, “An
vol. 6, no. 5, pp. 267–281, 1992. improved KNN text classification algorithm based on
[110] I. Guyon, H.-M. Bitter, Z. Ahmed, M. Brown, and J. Heller, Simhash,” in Proceedings of the 2017 IEEE 16th Int Conf Cogn
“Multivariate non-linear feature selection with kernel mul- Informatics Cogn Comput ICCI∗CC, pp. 92–95, Oxford UK,
tiplicative updates and gram-schmidt relief,” in Proceedings July 2017.
8483, 2022, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.1155/2022/1883698 by Somalia Hinari NPL, Wiley Online Library on [13/01/2025]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
Computational Intelligence and Neuroscience 25

[129] C. Apté and S. Weiss, “Data Mining with Decision Trees and Natural Language Processing, pp. 1422–1432, Lisbon, Por-
Decision Rules,” Future Generation Computer Systems, tugal, September 2015.
vol. 13, 1997. [147] M. Zulqarnain, R. Ghazali, M. G. Ghouse, and
[130] V. N. Phu, V. T. N. Tran, V. T. N. Chau, N. D. Duy, and M. F. Mushtaq, “Efficient processing of GRU based on word
K. L. D. au, “A decision tree using ID3 algorithm for English embedding for text classification,” JOIV: International
semantic analysis,” International Journal of Speech Tech- Journal on Informatics Visualization, vol. 3, no. 4, 2019.
nology, vol. 20, no. 3, pp. 593–613, 2017. [148] O. Kuchaiev and B. Ginsburg, “Factorization tricks for LSTM
[131] C. N. Mahender, TEXT CLASSIFICATION AND CLASSI- networks,” in Proceedings of the Fifth Int Conf Learn Rep-
FIERS, vol. 3, pp. 85–99, 2012. resent ICLR 2017 - Work Track Proc 1–6, Toulon, France,
[132] Y. Yang, “An evaluation of statistical approaches to text April 2019.
categorization,” Inf Retr Boston, vol. 1, pp. 69–90, 1997. [149] N. Shazeer, A. Mirhoseini, K. Maziarz et al., “Outrageously
[133] B. P. Roe, H. J. Yang, J. Zhu, Y. Liu, I. Stancu, and large neural networks: the sparsely-gated mixture-of-experts
G. McGregor, “Boosted decision trees as an alternative to layer,” 2017, https://arxiv.org/abs/1701.06538.
artificial neural networks for particle identification,” Nuclear [150] B. C. W. Jain, “Attention is not Explanation,” 2019, https://
Instruments and Methods in Physics Research Section A: Ac- arxiv.org/abs/1902.10186.
celerators, Spectrometers, Detectors and Associated Equipment, [151] S. Serrano and N. A. Smith, “Is attention interpretable?” in
vol. 543, no. 2-3, pp. 577–584, 2005. Proceedings of the ACL 2019 - 57th Annu Meet Assoc Comput
[134] J. Ye, J. H. Chow, J. Chen, and Z. Zheng, “Stochastic gradient Linguist Proc Conf, pp. 2931–2951, Florence, Italy, July 2020.
boosted distributed decision trees,” in Proceedings of the [152] S. Vashishth, S. Upadhyay, G. S. Tomar, and M. Faruqui,
Eighteenth ACM conference on Information and knowledge “Attention Interpretability Across NLP Tasks,” pp. 1–10,
management - CIKM ’09, pp. 2061–2064, Hong Kong China, 2019, https://arxiv.org/abs/1909.11218.
November 2009. [153] T. Munkhdalai and H. Yu, “Neural semantic encoders,” in
[135] E. Wiener, J. Pedersen, and A. Weigend, “A neural network Proceedings of the 15th Conference of the European Chapter of
approach to topic spotting,” in Proceedings of the Annu Symp the, Valencia, Spain, April 2017.
Doc Anal Inf Retr, pp. 317–332, 1995. [154] T. Commissariat, “Ask me anything,” Physics World, vol. 33,
[136] R. Johnson and T. Zhang, “Effective Use of Word Order for no. 3, pp. 53–58, 2020.
[155] A. Vaswani, N. Shazeer, N. Parmar et al., “Attention is all you
Text Categorization with Convolutional Neural Networks,”
need,” in Proceedings of the Adv Neural Inf Process Syst 2017,
pp. 103–112, 2014, https://arxiv.org/abs/1412.1058.
pp. 5999–6009, Long Beach, CA, USA, December 2017.
[137] Y. Zhang, Z. Zhang, D. Miao, and J. Wang, “/ree-way
[156] C. Sun, X. Qiu, Y. Xu, and X. Huang, “How to fine-tune
enhanced convolutional neural networks for sentence-level
BERTfor text classification?” Lect Notes Comput Sci (Including
sentiment classification,” Information Sciences, vol. 477,
Subser Lect Notes Artif Intell Lect Notes Bioinformatics),
pp. 55–64, 2019.
Springer, Berlin, Germany, pp. 194–206, 2019.
[138] N. Kalchbrenner, E. Grefenstette, and P. Blunsom, “A
[157] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and
convolutional neural network for modelling sentences,” in
R. Soricut, “ALBERT: A Lite BERT for Self-supervised
Proceedings of the 52nd Annual Meeting of the Association for
Learning of Language Representations,” pp. 1–17, 2019,
Computational Linguistics, vol. 1, pp. 655–665, Baltimore, https://arxiv.org/abs/1909.11942.
Maryland, June 2014. [158] M. Joshi, D. Chen, Y. Liu, D. S. Weld, L. Zettlemoyer, and
[139] J. L. Elman, “Finding structure in time,” Cognitive Science, O. Levy, “SpanBERT: Improving Pre-training by Repre-
vol. 14, no. 2, pp. 179–211, 1990. senting and Predicting Spans,” 2019, https://arxiv.org/abs/
[140] V. Makarenkov, I. Guy, N. Hazon, T. Meisels, B. Shapira, and 1907.10529.
L. Rokach, “Implicit dimension identification in user-gen- [159] C. Holzhey, F. Larsen, and F. Wilczek, “Geometric and
erated text with LSTM networks,” Information Processing & renormalized entropy in conformal field theory,” Nuclear
Management, vol. 56, no. 5, pp. 1880–1893, 2019. Physics B, vol. 424, no. 3, pp. 443–467, 1994.
[141] L. Sarmento, P. Carvalho, M. J. Silva, and E. de Oliveira, [160] K. Nigam, J. Lafferty, and A. Mccallum, “Using maximum
“Automatic creation of a reference corpus for political opinion entropy for text classification,” Computet Science, vol. 80,
mining in user-generated content,” Attitude, vol. 29, 2009. 1999.
[142] N. Ruchansky, S. Seo, and Y. Liu, “CSI: A Hybrid Deep [161] J. Atkinson-Abutridy, C. Mellish, and S. Aitken, “Combining
Model for Fake News Detection,” in Proceedings of the 2017 information extraction with genetic algorithms for text
ACM on Conference on Information and Knowledge mining,” IEEE Intelligent Systems, vol. 19, no. 3, pp. 22–30,
Management, Singapore, November 2017. 2004.
[143] S. Dong and C. Liu, “Sentiment classification for financial [162] A. Bhardwaj, Y. Narayan, Vanraj, Pawan, and M. Dutta,
texts based on deep learning,” Computational Intelligence “Sentiment analysis for Indian stock market prediction using
and Neuroscience, vol. 2021, Article ID 9524705, 9 pages, sensex and nifty,” Procedia Computer Science, vol. 70,
2021. pp. 85–91, 2015.
[144] Y. Zhu, W. Zheng, and H. Tang, “Interactive dual attention [163] D. Kim, D. Seo, S. Cho, and P. Kang, “Multi-co-training for
network for text sentiment classification,” Computational document classification using various document represen-
Intelligence and Neuroscience, vol. 2020, Article ID 8858717, tations: TF-IDF, LDA, and Doc2Vec,” Information Sciences,
11 pages, 2020. vol. 477, pp. 15–29, 2019.
[145] J. Chung, “Gated Recurrent Neural Networks on Sequence [164] M. H. Steinberg, “Clinical trials in sickle cell disease: adopting
Modeling,” pp. 1–9, 2014, https://arxiv.org/abs/1412.3555. the combination chemotherapy paradigm,” American Journal
[146] D. Tang, B. Qin, and T. Liu, “Document modeling with gated of Hematology, vol. 83, no. 1, pp. 1–3, 2008.
recurrent neural network for sentiment classification,” in [165] Z. Xu and J. Sun, “Model-driven deep-learning,” National
Proceedings of the 2015 Conference on Empirical Methods in Science Review, vol. 5, no. 1, pp. 22–24, 2018.
8483, 2022, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.1155/2022/1883698 by Somalia Hinari NPL, Wiley Online Library on [13/01/2025]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
26 Computational Intelligence and Neuroscience

[166] Y. Chen, B. Xiao, Z. Lin, C. Dai, Z. Li, and L. Yan, “Multi-


label text classification with deep neural networks,” in
Proceedings of the 2018 Sixth IEEE Int Conf Netw Infrastruct
Digit Content, pp. 409–413, IEEE, Guiyang China, November
2018.
[167] R. B. Pereira, A. Plastino, B. Zadrozny, and
L. H. C. Merschmann, “Categorizing feature selection
methods for multi-label classification,” Artificial Intelligence
Review, vol. 49, no. 1, pp. 57–78, 2018.
[168] R. A. Stein, P. A. Jaques, and J. F. Valiati, “An analysis of
hierarchical text classification using word embeddings,”
Information Sciences, vol. 471, pp. 216–232, 2019.
[169] R. A. Sinoara, C. V. Sundermann, R. M. Marcacini,
M. A. Domingues, and S. R. Rezende, “Named Entities as
Privileged Information for Hierarchical Text Clustering,” in
Proceedings of the Eighteenth International Database
Engineering & Applications Symposium, Association for
Computing Machinery, pp. 57–66, New York, NY, USA, July
2014.

You might also like