0% found this document useful (0 votes)
116 views16 pages

A Comparative Analysis of Logistic Regression, Random Forest and KNN Models For The Text Classification

Uploaded by

Suresh Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
116 views16 pages

A Comparative Analysis of Logistic Regression, Random Forest and KNN Models For The Text Classification

Uploaded by

Suresh Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Augmented Human Research (2020)5:12

https://doi.org/10.1007/s41133-020-00032-0 (0123456789().,-volV)(0123456789().
,- volV)

ORIGINAL PAPER

A Comparative Analysis of Logistic Regression, Random Forest


and KNN Models for the Text Classification
Kanish Shah1 • Henil Patel1 • Devanshi Sanghvi1 • Manan Shah2

Received: 27 August 2019 / Revised: 6 February 2020 / Accepted: 13 February 2020


 Springer Nature Singapore Pte Ltd. 2020

Abstract
In the current generation, a huge amount of textual documents are generated and there is an urgent need to organize them in
a proper structure so that classification can be performed and categories can be properly defined. The key technology for
gaining the insights into a text information and organizing that information is known as text classification. The classes are
then classified by determining the text types of the content. Based on different machine learning algorithms used in the
current paper, the system of text classification is divided into four sections namely text pre-treatment, text representation,
implementation of the classifier and classification. In this paper, a BBC news text classification system is designed. In the
classifier implementation section, the authors separately chose and compared logistic regression, random forest and
K-nearest neighbour as our classification algorithms. Then, these classifiers were tested, analysed and compared with each
other and finally got a conclusion. The experimental conclusion shows that BBC news text classification model gets
satisfying results on the basis of algorithms tested on the data set. The authors decided to show the comparison based on
five parameters namely precision, accuracy, F1-score, support and confusion matrix. The classifier which gets the highest
among all these parameters is termed as the best machine learning algorithm for the BBC news data set.

Keywords Text classification  Machine learning algorithm  Logistic regression  Random forest  K-nearest neighbour 
Natural language processing  Feature extraction

Introduction applicable to any working domains in this world. AI is


something that works on past learning and helps in pre-
In this digital world, the technology has shown its domi- dicting the future values by giving the accuracy of the
nance as the humans have pushed their limit to think by algorithm applied on a data set [1, 36–38]. There are many
binding artificial brain with human one [19, 35]. By per- domains of AI like machine learning, deep learning, ANN
forming this process, a new field has come in existence (artificial neural network), CNN (convolutional neural
named Artificial Intelligence. AI is termed as the devel- network) which help in developing a more advance tech-
opment of such systems which can perform the work which nology [23, 47].
humans generally do, like distinction between two different Machine learning has gained so much importance and
objects, recognizing different voices and many more. It value in the recent years. There has been a recent resur-
comes under the Computer Science branch and has gained rection in this field as implementers are now able to pro-
so much importance and popularity worldwide that it is vide more transparency in the algorithms. ML, being one of
the domains of AI, has made many things possible which
once were thought as a tedious task. That’s why this
& Manan Shah domain is so potent and influential. ML is completely
[email protected]
based on mathematics and statistics. It is an approach to
1
Department of Computer Engineering, Indus University, build intelligent systems. Similar to AI, it helps in the
Ahmedabad, Gujarat, India prediction of future by analysing past data and experience.
2
Department of Chemical Engineering, School of Technology, There are many applications of ML like object differenti-
Pandit Deendayal Petroleum University, Gandhinagar, ation and classification, speech recognition, text
Gujarat, India

123
12 Page 2 of 16 Augmented Human Research (2020)5:12

classification, prediction of weather, checking of destroyed among the three, is the best classifier. The contribution of
crops, face recognition, medical diagnostics, etc. this manuscript was that a higher accuracy and precision
[14, 18, 25]. ML works purely on data; thus, Big Data has were obtained for the algorithms but there was still a room
played a vital role in its evolution [37, 38, 43]. for improvement. The algorithms were successful in clas-
Another important concept which has played a vital role sifying the textual data and gave much better output than
in the classification of text is natural language processing expected. Therefore, these algorithms can be used with
(NLP). In today’s data generating generation, the data are better optimization than the current one.
generated in zettabytes and approximately around 80% of
data generated is in unstructured form. So, it is necessary to
convert data into structured form in order to derive its Related Works
meaning. Here, NLP and text mining come into light.
Moreover, there are various human languages in this world Related Study on Logistic Regression
and each and every person writes the text in their own
languages like Persian, Turkish, Chinese, English and Prabhat and Khullar [39] have presented that the enormous
many more. But it is the task of the computer to perform amount of data stored and flowing online cannot be mined
analysis and derive the meaning of those texts. NLP helps effectively to extract valuable information and decision
the computers to bring out the useful meaning in a smarter cannot be taken based on extracted information. Sentiment
way. This algorithm has gained importance in the recent analysis is a method in which we judge people’s ideas,
years. NLP algorithm is somewhat dependent on machine opinions, feelings, attitude, thought and belief about a
learning algorithms for learning the rules [44]. NLP is used specific concept. Authors presented sentiment classification
in text classification, extraction and tracing of information, on Big Data using Naive Bayes classifier and logistic
tagging of speech, opinion mining and lot more [13]. regression. Authors have used supervised and unsupervised
Text classification is similar to mapping used in math- learning algorithms. The performance of algorithms has
ematics and has the mathematical form as: been evaluated on the basis of different parameters like
f : X!Y accuracy, precision and throughput. The analysis with
logistic regression gives 10.1% more accuracy and 4.34%
where X is the different sets of text to be classified and more precise results with almost one-fifth implementation
Y being the set of categories [32]. time for same size of data set compared to Naive Bayes
Text classification in recent years has made continuous classifier.
success, and the application of this technology in various Yen et al. [52] have implemented logistic regression
applications like emotion classification from text, classifi- model for Chinese text categorization. Instead of tokeniz-
cation of spam and many other has become immense and ing the words which is a common method in text catego-
boundless [11, 26, 31]. The current paper talks about the rization, the authors have presented a new method which
text classification which is defined as the process of uses N-gram-based language model. This model takes
labelling texts with relevant categories from a data set that word relations into account for Chinese text categorization
is already predefined. By classifying the content into dif- without Chinese word tokenizer. To prevent from out-of-
ferent categories help users to search something very vocabulary, we also propose a novel smoothing approach
easily. After grouping the text, it is then analysed by dif- based on logistic regression to improve accuracy. They
ferent models which has the task of applying tags to the used the logistic regression to smooth the probability of
content. These models are nothing but the machine learn- n-gram. They proposed a novel feature selection method
ing algorithms and are also known as classifiers. These which is suitable to N-gram-based model. Secondly, they
classifiers need to be trained for making predictions on the proved that it could improve the F-measure in most case.
textual data set. These classifiers are trained by assigning Liu et al. [30] put upfront that in the recent years multi-
the tags and then making associations on the pieces of text. label classification has gained so much popularity, but still
The authors have used three classifiers in this paper for text there is a need to tackle the ubiquitous data. So, the authors
classification namely logistic regression, random forest and have presented novel framework for multi-label learning
K-nearest neighbour. Logistic regression is used to mea- which can achieve the purposes of classification learning
sure the statistical significance of each independent vari- and selection of variables simultaneously. Therefore, they
able with respect to probability. Random forest works on have used logistic regression to train the models for multi-
decision trees which are used to classify new object from label data for classification. They also solved the convex
input vector. K-nearest neighbour classifier helps in optimization problem of logistic regression with the elastic
maintaining groups by keeping similar things together. net penalty by a quadratic approximation technique for
These algorithms are then compared to find out which, better performance. The results were improved

123
Augmented Human Research (2020)5:12 Page 3 of 16 12

performance and better accuracy, and also the model is of the random forest classifier and nearest neighbourhood
competitive among the other six models used. projection of training set and (2) a detailed experimental
Aseervatham et al. [4] have talked about how logistic analysis. This showed that their approach is very effective
regression is used for tackling the text categorization and feasible and can be considered for text classification.
problems. The authors have stated that the ridge logistic Nadi and Moradi [34] put upfront that random forest is
regression has the performance as that of the support vector one of the most powerful ensemble method with high
machine (SVM) algorithm. However, the advantage of performance when it comes to high-dimension data. The
using logistic regression is that of computing the proba- authors in this paper have proposed a novel approach to
bility value rather than calculating a score. They have increase the performance of random forest by increasing
presented a new selection method which approaches the the number of trees and decreasing the number of levels for
ridge solution by a sparse solution. This selection method each tree in random forest. In this approach, the trees are
first computes the ridge solution and then performs the bounded to certain depth for allowing increase in views.
feature selection. The final result was that this method gave Each tree that is being bounded is considered as a local
a solution that is good tradeoffs between ridge and LASSO view of problem and more the local view better is the
solutions. classification. The results showed that by binding the trees
the accuracy can be improved for high-dimension
Related Study on Random Forest problems.
Wu et al. [50] have used random forest for imbalanced
Elghazel et al. [12] have talked about ensemble multi-label text categorization. The authors have presented a new
text categorization using rotation forest and latent semantic random forest ensemble method named ForesTexter. This
indexing. The authors have proposed a method based on algorithm uses a simple random sampling of features in
four ideas. (1) to perform latent semantic indexing on building their decision trees. The main idea is to stratify the
lower dimensional space of concepts, (2) splitting the features into two different groups and generate term
vocabulary randomly, (3) bootstrapping of document and weighting for the features. One of the groups contains
(4) use the BoosTexter as a powerful multi-label base positive features for minority class, whereas the other
learner for text categorization for improving accuracy. group contains negative features for the majority class.
Accuracy is promoted through underlying semantic struc- Therefore, the tree model becomes more robust for text
ture in the text. The combination of latent semantic categorization task with imbalanced data set. The result
indexing and rotation forest brings about improvements in showed that the proposed method is competitive against
average precision, ranking, loss and error compared to five standard random forest classifier and SVM algorithms.
other state-of-the-art approaches across textual data.
Al Amrani et al. [2] put upfront that in the research area Related Study on KNN algorithm
sentiment analysis has become more popular. This method
allocates positive and negative polarity to the items with Moldagulova and Sulaiman [33] put upfront that the recent
the help of natural language tools and also helps in pre- years have arisen the need of generation and structuring of
dicting high or low performance of the classifiers that are textual documents. It has become one of the vital things
being implemented. Here, random forest and support vec- which seek the attention to resolve the problems of com-
tor machines (SVM) are implemented. The authors have putational resources and bring forth the optimal result.
focused on the sentiment analysis that is resulted from the Authors have used method for handling unstructured text
product reviews by using the original techniques for data, especially document classification problems. Authors
searching of text. The reviews are then been classified into have performed document classification using KNN algo-
positive and negative category based on a relation to a rithm by term vector space reduction. The purpose is to
query. The result outperformed the classifiers that were justify the choice of the algorithm term vector space model
used in the Amazon data set. and enhance the KNN text classifier algorithm for imple-
Salles et al. [42] have solved the problem of high-di- mentation of document classification tasks. It was con-
mension noisy data by using random forest algorithm. The cluded that while using K-NN classifier, even by
authors have presented the lazy version of traditional ran- shortening the size of feature space by factor of 10, there
dom forest classifier also named as Lazy NN_RF. This were minimal changes in the accuracy whereas time cost
model is specially designed for high dimension classifica- decreased rapidly, almost exponentially.
tion tasks. The training projection is comprised of exam- Hmeidi et al. [17] have implemented classifiers for
ples that resemble the examples to be classified and Arabic articles text categorization. These authors have
obtained through nearest neighbourhood training projec- done a comparative study by comparing two different
tion. The main contributions include: (1) implementation classifiers/machine learning models for categorization of

123
12 Page 4 of 16 Augmented Human Research (2020)5:12

Arabic text. The data set was divided into training set effective in classification or when compared with fuzzy
which contained news articles and the same with test set. ID3 algorithm.
Considering the full word features they used the tf-idf Aydogan and Karci [5] have calculated the improvement
Vectorizer as the weighting method for the selection of in accuracy for classification of texts that were in Turkish
features. CHI statistics was used as a ranking metrics. The language. They had used deep neural networks as the
result of the experiment showed that both the method used algorithm for getting the result. Word2Vec method was
had an excellent performance on the data set, while the used for training word vectors on the two different Turkish
SVM performed well on average, F1-score and the time of data sets. The methods employed here were deep neural
prediction. networks, convolutional neural networks, long short term
Tan [46] has implemented K-nearest neighbour classi- memory, gated recurrent unit and bidirectional LSTM. The
fier for the effective refinement strategy. It is being propped model contains variety of stages like word
implemented due to its simplicity and better efficiency. embedding, continuous bag of words model, skip gram
However, KNN suffers from a problem of model misfitting model and skip gram model. The outcome was that the
due to its assumptions. Due to this the author has proposed values of accuracy, precision, recall and F1-score achieved
a new refinement strategy named as DragPushing for the different values for different models, but the highest was of
K-nearest neighbour classifier to solve this problem. The the gated recurrent unit. However, none of the values
outcome was that experiment showed that DragPushing exceeded more than 0.85 for any model.
achieved an improvement in the performance of the Zhu et al. [54] had discussed about a problem of class
K-nearest neighbour classifier. discrimination for improving the performance of text
Trstenjak et al. [48] have carried out somewhat similar classification. Naı̈ve Bayes algorithm has been applied for
work as that of Ismail Hmeidi. The authors of this paper the experiment. Two models namely multi-variate Ber-
also used K-nearest neighbour classifier for the text clas- noulli model and multinomial models are used here. For
sification. They focused on the possibility of using TF-IDF the extraction of features, mutual information and docu-
Vectorizer method and framework for the classification of ment frequency thresholding are employed here. Then,
text in KNN classifier. This framework enabled classifi- comes the Gaussian divergence which helps in measuring
cation according to various parameters and analysis of the dissimilarity between two classes. The main aim is to
results. Speed and quality of classification were considered get better and accurate ranking order. After this step,
the judgment parameters. The outcome was that the selection of features takes place. The result showed that the
experiment classified the good and the bad features of the discrimination-based feature selection can be effective in
algorithm. enhancing discriminative power in text classification.
Jiang et al. [21] also improved the K-nearest neighbour Raychaudhuri et al. [41] had done a comparative study
for the categorization of text. The authors have proposed a on three algorithms namely neural networks, decision trees
new method by combining constrained one pass clustering and support vector machines for the purpose of text clas-
algorithm with that of KNN classifier. The result of the sification. The data set consisted of 435 instances, 335
experiments showed that this method reduces the text samples of training data, 50 samples of test data and 50
similarity computation substantially and outperforms many samples of validation data. In terms of efficiency between
other models like Naı̈ve Bayes, support vector machines SVM and neural networks, it turned out that SVM per-
(SVM) and state-of-the-art K-nearest neighbour classifier. formed better than neural networks when the value of C
was 1 whereas neural network was better when the value of
Other Related Algorithms for Text Classification C was 1000. Fully grown decision tree performed better
than smaller decision tree. Therefore, the classification
Wahiba and Ahmed [49] have evaluated text documents techniques depend on many factors.
using fuzzy decision trees. This paper talks about a Garla et al. [15] used Laplacian and linear SVMs to
heuristic approach that has not been tested till now, and the work on the clinical text classification. The main target was
proposed method will get it tested. The learning space is to evaluate the actual effect of unlabelled training dada on
developed, and the weight of the attribute is calculated Laplacian SVM. The data set consisted of 820 abdominal
using the TF-IDF Vectorizer method. Then, fuzzy rule is CT, MRI and ultrasound reports, and the samples used by
applied and the proposed method has preprocessing phase, Laplacian SVM were around 19,845. The main aim of
membership degrees calculation, fuzzy decision tree using semi-supervised algorithm is that they can sometimes
induction and classification. The result was that the pro- perform better than supervised algorithms. On the training
posed model gave better classification results but the data, the feature extraction takes place and the computing
maximum value of F1-score did not exceed 48% which is graph for Laplacian is constructed. Then, kernel parameters
one of the drawback of the fuzzy systems but can be are used, SVM optimization technique is applied, and

123
Augmented Human Research (2020)5:12 Page 5 of 16 12

regularization is used. The final result was that the Macro the other. The classification of text is done using three
F1-scores were around 0.77 and sensitivity was around different machine learning algorithms which has been
091. implemented on a data set. The classification algorithms
Jiang et al. [20] have presented a technique that used are logistic regression classifier, random forest clas-
improves the performance of Naı̈ve Bayes text classifica- sifier and K-nearest neighbours classifier. They have
tion. The problem of skewed distribution of training data played an important role in evaluating machine learning
that was causing poor results was solved with this tech- strategies by giving a much easier and understandable
nique. Since the shorts data consisting of text consist of solution. Each of these three algorithms has a complete
more important words required for classification, while the different working from one another as one works on a
long data consist of many unnecessary words which particular formula for classifying and predicting, the other
decrease the probability. Therefore, this method transforms works by constructing nodes and trees (random forest). A
to estimate the probability of the feature that can be cal- solution has been developed about the classification of
culated both in a global as well as local view. Different texts by each of these three algorithms.
graphs were constructed for the calculation of F1-score The overview of architecture is shown in Fig. 1. At the
versus the alpha parameter and achieved different values beginning, we import the necessary libraries which can also
for different parameters. simultaneously be included in code as we proceed further.
Next, we load the data set on which we need to perform the
Comparative Analysis of the Past Study classification. To evaluate the representation, we used the
BBC news data set [45]. The data set contains two columns
From the above different literature reviews that were done (i) category consisting of five different classes and (ii) text
on different algorithms like Naive Bayes, linear and which contains the text lines. Then, we perform the pre-
Laplacian support vector machine, neural networks like
convolutional neural network, recurrent network etc.,
decision trees and fuzzy logic algorithms, it was observed
that almost all the algorithms did not perform up to the
mark. Talking about the support vector machine, it
achieved great accuracy that did not went above 90% and
there has to be many parameters that need to set for per-
forming the classification algorithms and one parameter
that might give good accuracy in one might perform poor
in other. In neural networks, there is always a problem of
overfitting, increasing the load due to more hidden layers
and the empirical nature of its model too. Talking about
Naı̈ve Bayes classifier, there is always a chance that the
algorithm might lose the accuracy and sometimes a zero
probability is assigned to the category which does not have
categorical variable. In fuzzy logic, the results are achieved
on the basis of assumption so it might not be accept-
able everywhere and is not that accurate compared to
logistic regression, random forest and many others. The
problem with the decision tree is that many a time they are
unstable as a small change might result in a drastic change.
The inaccuracy is always a problem with decision trees as
it performs better when its size is big and poor when
smaller in size.

Methodology

There are ample numbers of machine learning algorithms


that are used in for the classification of texts. But all do not
have the same accuracy or precision, i.e. one might have
low accuracy, while other might have higher accuracy than Fig. 1 Architecture of the implementation

123
12 Page 6 of 16 Augmented Human Research (2020)5:12

treatment or cleaning of text. After this step, the focus To represent a document for classification, two major
comes on the representation of text. In the testing phase, aspects play an important role which are text pre-treatment
the three classifiers are applied on the data set which gave or pre-processing and the other one is term weighting [16].
five parameters as the output and also to compute which We shall discuss the steps of text pre-treatment or text
feature has the maximum value for a particular class in the cleaning. There are some words which have no part to play
data set [40, 51]. The parameters computed are accuracy, in discrimination of classes. Those words can be preposi-
precision, F1-score, support and confusion matrix. tions, conjunctions and pronouns. These words have no
context in the text because they do not contribute in clas-
Text Pre-treatment and Representation sifying the classes [53]. These words are known as stop
words. So, it is necessary to remove these stop words like
Before this step, the authors decided to plot the five classes ‘the’, ‘a’, ‘and’, ‘but’, ‘or’, etc. Therefore, the English stop
of data set on a bar graph to determine what number of text words were downloaded. After downloading the stop word
belongs to a particular class and one can get a clear picture list, we have to check the word against that list and filter
about the data set without actually looking it. the words from the list [24]. After that, we perform stem-
The authors obtained this graph after the implementation ming on the text which is kind of normalizing the text. A
of code. The graph was plotted with category on x-axis sentence can be written in much different form by changing
versus number of text lines on y-axis. For instance, the the tenses but has the same meaning. So, stemmer helps in
number of text in business category is above 500, enter- removing those tenses by bringing the sentences in same
tainment close to 400, politics above 400 and so on. After meaning. Stemmer’s main task is to condense the sen-
demonstrating this graph, the steps for pre-processing ini- tences. The algorithm we have used for stemming is the
tiates (Fig. 2). Porter Stemmer which performs the condensing task. Next,
we focus on the representation of text. First, we have
applied lambda function and then used the joint operation
on the text that we had obtained after stemming. After that,
we had used the sub-operation for checking whether all the
text is in the small and capital alphabets or not. After
checking that, we convert all the text into lower letters in
order to unify all the text so that the classification becomes
easier. After pre-processing and cleaning, we obtained the
output of cleaned text as shown in Table 1.
In Table 1, the first column shows name of categories
that we used for implementation. The next column displays
the text before performing pre-processing on it. The last
column displays how the text has become cleaned by
removing irrelevant (not useful for text classification)
words from the text, for instance removing words like of,
in, the, etc.
After cleaning, it is very important to represent the text
in which machine can easily understand. Therefore, Vec-
torizer has been used which converts the sentence or text
Fig. 2 Bar graph depicting the number of texts lines for different into an array or vector of numbers. TF-IDF (term fre-
classes quency-inverse document frequency) Vectorizer is

Table 1 Text before and after cleaning


Sr. no. Category Text Cleaned

1 Tech Tv future in the hands of viewers with home th… Tv futur hand viewer home theatr system plasma…
2 Business Worldcom boss left books alone former worldc… Worldcom boss left book alon former worldcom b…
3 Sport Tigers wary of farrell gamble Leicester say… Tiger wary farrel gamble leicest say rush make…
4 Sport Yeadling face newcastle in fa cup premiership s… Yead face Newcastl fa cup premiership side new…
5 Entertainment Ocean s twelve raids box office ocean s twelve… Ocean twelv raid box offic ocean twelv crime c…

123
Augmented Human Research (2020)5:12 Page 7 of 16 12

implemented for transforming the text into a representation Table 3 True positive ? false positive = total predicted positive
of numbers which has certain meaning. The term frequency
is used for normalizing the occurrence of each word with Negative Positive
size of the data set, whereas the inverse document fre-
quency is used for removing the words which do not Negative True Negative False Positive
contribute much for deciding the meaning of the sentence.
So, if the term occurs in the text, it would become zero.
The TF-IDF technique eliminates the common words and Positive False Negative True Positive
also extracts the relevant features from the corpus [6]. The
TF-IDF algorithm’s main focus is on the word that has the
higher frequency in the text, and at the same time, it
appears in the corpus in a smaller range. Then, this word
has the strongest capability to distinguish different classes Table 4 True positive ? false negative = actual positive
in the text [32].
The mathematical formula of TF-IDF algorithm is given Negative Positive
by:
TF-IDFðwÞ ¼ TFðwÞ  IDFðwÞ Negative True Negative False Positive

Authors had passed n-gram as an argument which con-


sists of adjacent letters or words in the text that helps in Positive False Negative True Positive
predicting the next item in a sequence. N-gram captures the
structure of a language like what letter or word would
follow the previous one. N-gram is to generate word vector
based on the context of the text [32]. Moreover, we had
Accuracy is defined as the ratio of observations pre-
used norm function for returning one of different matrix
dicted correctly to the total number of observations. The
depending on the value Ord parameter. l2 norm is used for
mathematical formula is defined as
minimizing the sum of squares of differences between
target value and estimated value. After performing this, we Accuracy ¼ True Positives þ True Negatives=
returned all the elements in form of an array. Finally, we ðTrue Positives þ True Negatives
applied logistic regression, random forest and K-NN clas- þFalse Positives þ False NegativesÞ
sifiers separately in order to calculate accuracy, precision,
F1-score and support. Precision is defined as the ratio of observations that are
True positives is defined as the predicted values pre- predicted positively correct to the total number of obser-
dicted correctly, i.e. the value of predicted class is yes and vations predicted positively. The mathematical formula is
that to of actual class is also yes. defined as
True negatives is defined as negative values predicted Precision ¼ True Positives=True Positives
correctly, i.e. the value of predicted class is no and that to þ False Positives
of actual class is also no.
False positives is defined as the condition when the Recall is defined as the ratio of observations that are
predicted class is yes whereas the actual class is no. predicted positively correct to the total number of obser-
False negatives is defined as the condition when the vations in an actual class. The mathematical formula is
predicted class is no whereas the actual class is yes defined as
(Tables 2, 3, 4). Recall ¼ True Positives=True Positives þ False Negatives

F1-score is defined as the harmonic average of recall


and precision. The mathematical formula is defined as
Table 2 Diagram depicting four parameters
F1 score ¼ 2  ðrecall  precisionÞ=ðrecall þ precisionÞ
Predicted class
Support is defined as the total number of samples of true
Actual class S Class = Yes Class = No
response which lies in a class.
Class = Yes True positive False negative
Confusion matrix is defined as how much a classifier is
Class = No False positive True negative
able to predict the correct value, i.e. true positives in

123
12 Page 8 of 16 Augmented Human Research (2020)5:12

classification or what number of values belongs to the decision trees whose nodes are defined at the pre-pro-
correct class rather than the other class. cessing step [7]. After constructing multiple trees, the best
feature is selected from the random subset of features
Implementation of Classifier [8, 22]. To generate a decision tree is another concept
which is formed using decision tree algorithm. So, random
After performing text pre-treatment and representation, the forest consists of these trees which are used to classify new
classifiers are now to be implemented. We had considered object from input vector. Each decision tree built is used
three classifiers namely logistic regression, random forest for classification. Suppose we give the tree votes to that
and KNN to determine which has the best output. In the class then the random forest chooses the classification that
beginning, we had split the data set into training and testing has the most number of votes among all the trees in the
set, the size of the testing set being 25% and training set forest. There are also some chances of error in the random
being 75%. After splitting, we had used the pipeline for forest depending upon two parameters:
implementing the classifiers. Pipelining is used for better
(i) There are chances that two trees in a forest may
flow of an algorithm. Pipelines work by enabling sequence
have correlation between each other which leads to
of the data that needs to be transformed and to correlate in
increase in error rate.
a model that needs to be tested and evaluated for achieving
(ii) Each tree has its own strength. So, a tree which has
an outcome. As all the classifiers implemented are super-
lower error rate is a strong classifier and vice versa.
vised so the data sets are provided and one has to just apply
classification algorithms. Generally a machine learning • Some of the features of random forest are:
pipeline consists of four stages namely pre-processing,
(i) Handles huge number of input variables with-
learning, evaluation and prediction. There were many
out deletion of variables.
reasons behind the use of pipeline. Firstly, pipelining
(ii) It suggests the variables which are important in
improves the overall functioning of the model. Secondly, it
classification.
also helps in pre-processing and enhancing of the data,
(iii) The databases that are large also runs
better handling of the over fitting caused by the data set and
efficiently.
better tuning with the hyper-parameters of the pipeline.
(iv) The trees or forests generated can be saved for
While implementing the classifier with the help of pipeline,
future use also.
the use of pickle has also been included. Pickle is generally
used for serializing and de-serializing objects in python. • Steps for random forest algorithm:
With the help of pickle, the program is stored in the disk so
Step 1: From the training data, pick K random data
it becomes easy to work at ones ease. It is also helpful in
points.
making new predictions without rewriting everything again
Step 2: Construct a decision tree with these K data
(Fig. 3).
points.
Step 3: Before repeating steps 1 and 2, choose the
Random Forest Algorithm
number NTree of tree you want to construct.
Step 4: Predict the value of y by making each one of
In this algorithm, a large number of decision trees are built
NTree trees for a new data point and assign new data
as they operate together. Decision trees act as pillars in this
point average across all predicted y values.
algorithm. Random forest is defined as the group of
The mathematical formula for random forest classifier
is:
nij ¼ wl Cj wleftðjÞ CleftðjÞ  wrightðjÞ CrightðjÞ

ni sub(j) = the importance of node j


w sub(j) = weighted number of samples reaching node j
C sub(j) = the impurity value of node j
left(j) = child node from left split on node j
right(j) = child node from right split on node j

K-Nearest Neighbour Algorithm

This algorithm focuses on the keeping the similar things


Fig. 3 Workflow of classifiers nearest to each other. This model works on class labels and

123
Augmented Human Research (2020)5:12 Page 9 of 16 12

feature vectors in a data set [9]. KNN stores all the cases
and helps in classifying new cases with the help of simi-
larity measure. In K-nearest neighbours, text is represented
using a spatial vector which is denoted by S = S(T1, W1;
T2, W2; … Tn, Wn). For any text, similarity is found and
calculated using the training text and the texts with highest
similarity are selected. Finally, the classes are determined
based on K neighbours (Fig. 4).
• Steps for performing KNN algorithm

Step 1: At the beginning, choose number K of


neighbours.
Step 2: According to the Euclidean distance, take the
K-nearest neighbours of new data point.
Step 3: In each category, perform the count of
number of data points among the K neighbours.
Step 4: After counting, assign the new data point
where you counted the most neighbour (Fig. 5).
• The steps of KNN involved in classification of text are:
Fig. 4 Flow chart for random forest classifier
1. Both training text and incoming text are expressed
as feature vectors in vector space.
2. 2) Then, comparison between the feature vector of
the incoming text and that of each training text is
calculated with a mathematical formula.
PM
k¼1 WikWjk
simðdi; djÞ ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
PM ffiqffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
PM ffi
Wik 2 Wjk 2
k¼1 k¼1

where di and dj are the feature vectors of the


incoming and training text, respectively, M being
the dimension of feature vector. Wik and Wjk are
the k-th elements of vectors di and dj, respectively.
3. Lastly, the K-nearest neighbours of that incoming
text are selected based on the similarity or
comparison of texts [47].
 
sim di ; dj d ðdi ; Cm Þ

The formulas behind the KNN are


X
K
Qðdi; CmÞ ¼ simðdi; djÞdðdi; CmÞ
j¼1

1; if di 2 Cm
dðdi; CmÞ ¼
0; if di 2 Cm

Logistic Regression Algorithm

Logistic regression comes under the supervised classifica-


tion algorithm. This algorithm has achieved importance in
recent times, and the use has increased extensively. This
Fig. 5 Flow chart for K-NN classifier

123
12 Page 10 of 16 Augmented Human Research (2020)5:12

sigmoid curve. It is special case of logistic regression. To


understand the mathematical version of the explanation, we
will begin with a simple linear regression formula
y ¼ b0 þ b1  x
So, now the sigmoid function is applied on it, and it is
given by the formula
1

1 þ ey
Now the value of y is calculated by substituting one
formula into the other; we get our logistic regression for-
mula as
 
p
ln ¼ b0 þ b1  x
1p
Or
logitðSÞ ¼ b0 þ b1M1 þ b2M2 þ b3M3. . .bkMk. . .
where, S is probability of the presence of interest features.
M1, M2, M3……Mk are the predictor value
Fig. 6 a Graph when data points do not fit properly. b Graph when bo, b1, b2, b3……bk are the intercept of model.
logistic regression is applied and one gets a perfect curve
• Assumptions in the logistic regression classifier:
algorithm is used to classify individuals in the categories 1. There is no linear relationship between dependent
based on logistic function [29]. There are many instances and independent variables in logistic regression.
when one does not get a perfect graph that fits all the data 2. The dependent variable cannot be divided into two
points. For instances, we might encounter with problems parts.
like the graph mentioned in Fig. 6a. 3. The dependent variables must not be normally
The graph in Fig. 6a states how the action varies with distributed instead they should be linearly related.
respect to age. So, this graph is not at all appropriate and it
does not fit to all the data points. The solution is to In text classification, LR model recognizes a vector
implement logistic regression algorithm. When we apply containing variables and then evaluates the coefficients for
this algorithm to these set of data points, we get the graph each input variable and predicts the class of text in the form
as drawn in Fig. 6b. This graph is very much appropriate as of word vector.
it perfectly fits to all the data points. This graph or curve
can be properly visualized as drawn in Fig. 7.
This is the speciality behind using logistic regression. Results and Outcome
The curve in Fig. 7 is obtained because of the use of sig-
moid function in logistic regression. Sigmoid function is a After running the code successfully, the required output
mathematical function which is responsible for this S- was obtained. As mentioned earlier, the comparison
shaped curve. The curve obtained above is also known as between the algorithms is done on the basis of five
parameters namely accuracy, precision, F1-score, support
and confusion matrix. Let us compare the three algorithms
implemented one by one.

Logistic Regression

This model is used to measure the statistical significance of


each independent variable with respect to probability. It is
a powerful way of modelling binomial outcome. For
example: if the person is going to suffer from cancer or not
by taking values 0 and 1 with the use of one or many
Fig. 7 Sigmoid curve explanatory variables. The outcome variable in logistic

123
Augmented Human Research (2020)5:12 Page 11 of 16 12

Table 5 Resultant outcome of all parameters in five different cate- Random Forest algorithm
gories using logistic regression classifier
Category Precision Accuracy F1-score support Random forest is an ensemble method which is used to
build predictive models for both regression and classifica-
Business 0.94 0.99 0.97 133
tion problems. It consists of the random number of trees
Entertainment 1.00 0.98 0.99 91
which is used to give a desired output. It follows the
Politics 0.97 0.94 0.96 103 ensemble method or learning. In problems that have clas-
Sports 0.99 0.99 0.99 131 sification, those decision trees vote for the most popular
Technology 0.98 0.96 0.97 99 class whereas in regression problems the response of tree is
Accuracy 0.97 557 an estimate of dependent variables given the predictors.
On applying logistic regression to the data set, the out-
regression is dichotomous. It is used to assign observation put was obtained in this manner:
to a discrete set of classes and being a classification algo- ½½ 134 2 6 0 2 
rithm it very much depends on probability. ½ 4 86 1 3 2 
On applying logistic regression to the data set, the out- ½ 6 2 84 1 0 
put was obtained in this manner: ½ 2 0 1 132 1 
½ 3 2 0 0 83 
½½ 132 0 1 0 0 
Output Confusion Matrix
½ 0 89 2 0 0 
½ 3 0 97 1 2  It is observed from Table 6 that an accuracy of 93% is
½ 1 0 0 130 0  obtained. This figure shows that all the parameters for each
½ 4 0 0 0 95  of the individual classes are calculated. For instance, the
Output Confusion Matrix precision of business class obtained is 90%, accuracy is
It is observed from Table 5 that an accuracy of 97% is 93%, F1-score is 91% and value of support is 144. Simi-
obtained. This figure shows that all the parameters for each larly, for entertainment, the values obtained are 93% pre-
of the individual classes are calculated. For instance, the cision, 90% accuracy, 91% F1-score and 96 supports. In
precision of business class obtained is 94%, accuracy is politics, a precision of 91% is obtained, accuracy of 90%,
99%, F1-score is 97%, and value of support is 133. Simi- 91% F1-score and 93 supports is obtained. For sports class,
larly, for entertainment, the values obtained are 100% precision obtained is 97%, accuracy of 97%, F1-score of
precision, 98% accuracy, 99% F1-score and 91 supports. In 97% and support of 136. For technology class, a precision
politics, a precision of 97% is obtained, accuracy of 94%, of 94% is obtained, accuracy of 94%, F1-score of 94% and
96% F1-score and 103 support is obtained. For sports, class support of 88 is obtained. In confusion matrix, each row
precision obtained is 99%, accuracy of 99%, F1-score of and column corresponds to class so the value 134 in first
99% and support of 131. For technology class, a precision row and first column indicates that the classifier is able to
of 98% is obtained, accuracy of 96%, F1-score of 97% and properly classify 134 lines of text belonging to business
support of 99 is obtained. In confusion matrix, each row class and the rest the classifier puts in another class. Sim-
and column corresponds to class so the value 132 in first ilarly in second row, the value 86 means that the classifier
row and first column indicates that the classifier is able to is able to classify 86 lines of text that belongs to enter-
properly classify 132 lines of text belonging to business tainment class and the rest it puts in another class. The third
class and 1 on the first row third column indicates that the row states that 84 lines of text belongs to politics and the
classifier has put it in another class. Similarly in second classifier is able to properly separate it whereas the other
row, the value 89 means that the classifier is able to classify
89 lines of text that belongs to entertainment class and the
Table 6 Resultant outcome of all parameters in five different cate-
rest it puts in another class. The third row states that 97 gories using random forest classifier
lines of text belongs to politics and the classifier is able to
Category Precision Accuracy F1-score support
properly separate it whereas the other lines are put in
another classes which the classifier is not able to do that. In Business 0.90 0.93 0.91 144
the fourth row, this classifier is able to classify 130 lines of Entertainment 0.93 0.90 0.91 96
text for politics whereas the classifier is not able to classify Politics 0.91 0.90 0.91 93
that 1 line and hence puts it in business class. Lastly, the Sports 0.97 0.97 0.97 136
classifier is able to classify 95 lines of text that belongs to Technology 0.94 0.94 0.94 88
technology class and is not able to classify those 4 lines and Accuracy 0.93 557
hence puts it in business class.

123
12 Page 12 of 16 Augmented Human Research (2020)5:12

lines are puts in another classes which the classifier is not Table 7 Resultant outcome of all parameters in five different cate-
able to do that. In the fourth row, this classifier is able to gories using K-NN classifier
classify 132 lines of text for politics whereas the classifier Category Precision Accuracy F1-score support
is not able to classify the rest of the text lines and hence
Business 0.96 0.86 0.91 130
puts it in business, politics and technology class. Lastly, the
classifier is able to classify 83 lines of text that belongs to Entertainment 0.99 0.91 0.95 99
technology class and is not able to classify those 5 lines and Politics 0.77 0.99 0.86 109
hence puts it in different classes. Sports 0.99 0.97 0.98 129
Technology 0.98 0.89 0.93 90
K-Nearest Neighbours Algorithm Accuracy 0.92 557

Similar to random forest, KNN is also used for both clas-


technology class and is not able to classify those other lines
sification and regression problems, but in industries it is
and hence puts it in different classes.
widely known for solving classification problems rather
than regression. The input contains k-closest training
Comparison Analysis of Three Algorithms
examples lying in the feature space. It determines the group
where a data point lies by looking all the data points.
After discussing the results, the authors decided to compare
On applying K-nearest neighbour to the data set, the
all the three algorithms based on the four parameters, i.e.
output was obtained in this manner:
precision, accuracy, F1-score and support. These four
½½ 112 0 17 0 1  parameters are compared on the bar graph in order to display
½ 2 90 6 0 1  a perfect comparison. The four comparisons are shown
½ 0 1 108 0 0  below as:
½ 1 0 3 125 0 
½ 2 0 7 1 80  1) Precision
Output Confusion Matrix The graph obtained is shown in Fig. 8
It is observed from Table 7 that an accuracy of 92% is In the business section, logistic regression obtained pre-
obtained. For instance, the precision of business class cision of 0.94 (means 94%), random forest of 0.9 (means
obtained is 96%, accuracy is 86%, F1-score is 91% and 90%) and KNN has a precision of 0.96 (means 96%). In the
value of support is 130. Similarly, for entertainment, the entertainment section, logistic regression obtained precision
values obtained are 99% precision, 91% accuracy, 95% F1- of 1 (100%), random forest had 0.93 (93%) precision and
score and support of 99. In politics, a precision of 77% is KNN has a precision of 0.99 (99%). In the politics class,
obtained, accuracy of 99%, 86% F1-score and 109 supports precision for logistic regression obtained is 0.97 (97%),
is obtained. For sports class, precision obtained is 99%, random forest has precision of 0.91 (91%) and KNN has 0.77
accuracy of 97%, F1-score of 98% and support of 129. For (77%). In the section for sports, logistic regression obtained a
technology class, a precision of 98% is obtained, accuracy precision of 0.99 (99%), random forest with a precision of
of 89%, F1-score of 93% and support of 90 is obtained. In 0.97 (97%) and KNN with 0.99 (99%). At last in the tech-
confusion matrix, each row and column corresponds to five nology section, logistic regression obtained a precision of
different classes so the value 134 in first row and first 0.98 (98%), random forest has precision of 0.94 (94%) and
column indicates that the classifier is able to properly KNN has 0.98 (98%) precision (Figure 8).
classify 112 lines of text belonging to business class and 2) Accuracy
the rest the classifier puts in another class. Similarly in
second row, the value 90 means that the classifier is able to The graph for accuracy is shown is Fig. 9
classify those 90 lines of text that belongs to entertainment In the business section, logistic regression obtained an
class and the rest it puts in another class. The third row accuracy of 0.99 (means 99%), random forest has an
states that 108 lines of text belongs to politics and the accuracy of 0.93 (means 93%) and KNN has an accuracy of
classifier is able to properly separate it whereas the other 0.86 (means 86%). In the entertainment section, logistic
lines are puts in another classes which the classifier is not regression obtained an accuracy of 0.98 (98%), random
able to do that. In the fourth row this classifier is able to forest has accuracy of 0.90 (90%) and KNN has an accu-
classify 125 lines of text for politics whereas the classifier racy of 0.91 (91%). In the politics class, accuracy obtained
is not able to classify the rest of the text lines and hence for logistic regression is 0.94 (94%), random forest has an
puts it in business, politics and technology class. Lastly, the accuracy of 0.9 (90%) and KNN has accuracy of 0.99
classifier is able to classify 80 lines of text that belongs to (99%). In the section for sports, logistic regression obtained

123
Augmented Human Research (2020)5:12 Page 13 of 16 12

Fig. 8 Categories versus


precision showing variations of
data set classes with respect to 1.2
1 0.99 0.97 0.990.970.99 0.980.940.98
change in precision 0.94 0.9 0.96 0.93 0.91
1
0.77
0.8

0.6

0.4

0.2

0
Business Entertainment Polics Sports Technology

Logisc Regression Random Forest k-NN

an accuracy of 0.99 (99%), random forest with accuracy of obtained F1-score of 0.99 (99%), random forest with the
0.97 (97%) and KNN has the accuracy of 0.97 (97%). F1-score of 0.97 (97%) and KNN has the F1-score of 0.98
Finally in the technology section, logistic regression (98%). Finally, in the technology section, logistic regres-
obtained an accuracy of 0.96 (96%), random forest has an sion obtained F1-score of 0.97 (97%), random forest has
accuracy of 0.94 (94%) and KNN has an accuracy of 0.89 the F1-score of 0.94 (94%) and KNN with the F1-score of
(89%) precision (Figure 9). 0.93 (93%) (Figure 10).
3) F1-score 4) Support
The bar chart obtained for F1-score is shown in Fig. 10 For the support parameter, the bar chart obtained is
In the business section, logistic regression obtained F1- shown in Fig. 11
score of 0.97 (means 97%), random forest with the F1- In the graph obtained for business class, logistic
score of 0.91 (means 91%) and KNN has F1-score of 0.91 regression obtained support of 133, random forest with the
(means 91%). In the entertainment graph, logistic regres- support of 144 and KNN having a support of 130. In the
sion obtained a F1-score of 0.99 (99%), random forest has entertainment graph, logistic regression obtained a support
F1-score of 0.91 (91%) and KNN has F1-score of 0.95 of 91, random forest has the support of 96 and KNN having
(95%). In the politics section, F1-score obtained for the support of 99. In the politics section, the support
logistic regression is 0.96 (96%), random forest has F1- obtained for logistic regression is 103, random forest has
score of 0.91 (91%) and KNN has the F1-score of 0.86 the support of 93 and KNN has the support of 109. In the
(86%). In the graph obtained for sports, logistic regression sports graph, logistic regressions obtained have the support

Fig. 9 Categories versus


Accuracy showing variations of
data set classes with respect to 1 0.99 0.99 0.99
change in accuracy 0.98
0.97 0.97
0.96

0.95 0.94 0.94


0.93
0.91
0.9 0.9
0.9 0.89

0.86

0.85

0.8
Business Entertainment Polics Sports Technology

Logisc Regression Random Forest k-NN

123
12 Page 14 of 16 Augmented Human Research (2020)5:12

Fig. 10 Categories versus F1-


score showing variations of data
set classes with respect to 1 0.99 0.99
change in F1-score 0.98
0.97 0.97 0.97
0.96
0.95
0.95 0.94
0.93
0.91 0.91 0.91 0.91

0.9

0.86

0.85

0.8
Business Entertainment Polics Sports Technology

Logisc Regression Random Forest k-NN

of 131, random forest with the support of 136 and KNN et al. [50]. The scope of these algorithms can extend to
having the support of 129. Finally in the technology graph, different data sets that can have features based on images
logistic regression obtained the support of 99, random and audios. Currently existing technologies for these
forest has the support of 88 and KNN having 90 as the challenges are image recognition for the images data set
support (Figure 11). and POS (Part-Of-Speech) text recognition. Using these
would provide a wide area of application for this research.
A general analysis is being done for several classification
Challenges and Future Scope algorithms in machine learning (e.g. logistic regression,
decision trees, etc.) it is possible to explain and understand
While the model has been implemented with great accu- the model and the decisions given by the model [3].
racy and precision rate, there are certain challenges that can The current development towards automation can be
be looked upon for the future development of this work. much progressed by the use of text classification applica-
The data set does not provide accuracy while using SVM tions. These can directly alter the easy of application by
(support vector machine) algorithm [28]. Also, the data set transforming the commands that we speak, into direct
used here is completely statistical and text based. Random actions by machines [10]. Computer security susceptibility
forest has shown great success in many real-world appli- will always exist long as we have faulty security policies,
cations. The problem of learning from text data with class computer system which is poorly configured. There is one
imbalance is the problem which we are being faced (Wu important intrusion detection system which helps in

Fig. 11 Categories versus


support showing variations of
data set classes with respect to 160
change in support 144
133 136
140 130 131 129

120 109
103
96 99 93
99
100 91 88 90

80

60

40

20

0
Business Entertainment Polics Sports Technology

Logisc regression Random Forest k-NN

123
Augmented Human Research (2020)5:12 Page 15 of 16 12

detecting attacks on susceptibility or the faulty security References


policies [27]. There is also a major problem which text
classification faces is the high dimensionality of the feature 1. Ahir K, Govani K, Gajera R, Shah M (2020) Application on
space. There are several tens of thousands of features text virtual reality for enhanced education learning, military training
and sports. Augment Hum Res 5:7
domain has, though most of these features are not used for 2. Al Amrani Y, Lazaar M, El Kadiri KE (2018) Random forest and
text classification task. Even some of them may sharply support vector machine based hybrid approach to sentiment
reduce the classification accuracy [10]. analysis. Proc Comput Sci 127:511–520
3. Altınel B, Ganiz MC (2018) Semantic text classification: a survey
of past and recent advances. Inf Process Manag 54(6):1129–1153
4. Aseervatham S, Antoniadis A, Gaussier E, Burlet M, Denneulin
Conclusion Y (2011) A sparse version of the ridge logistic regression for
large-scale text categorization. Pattern Recogn Lett
The current paper is constructing a BBC news text classi- 32(2):101–106. https://doi.org/10.1016/j.patrec.2010.09.023
5. Aydoğan M, Karci A (2019) Improving the accuracy using pre-
fication model based on machine learning algorithms. This trained word embedding on deep neural networks for Turkish text
paper proposes the logistic regression, random forest and classification. Stat Mech Its Appl, Physica A. https://doi.org/10.
K-nearest neighbour algorithms which describes every 1016/j.physa.2019.123288
aspect of model in detail by providing the evaluation 6. Bafna P, Pramod D, Vaidya A (2016) Document clustering: TF-
IDF approach. In: 2016 international conference on electrical,
metrics. When machine learning algorithms are imple- electronics, and optimization techniques (ICEEOT), Chennai,
mented on a particular data set, the most important pp 61–66
parameter that matters is the accuracy. Hence, the result 7. Bouaziz A, Dartigues-Pallez C, da Costa Pereira C, Precioso F,
shows that logistic regression classifier with the TF-IDF Lloret P (2014) Short text classification using semantic random
forest. In: Bellatreche L, Mohania MK (eds) Data warehousing
Vectorizer feature attains the highest accuracy of 97% for and knowledge discovery. DaWaK 2014. Lecture notes in com-
the data set. This algorithm has emerged as the most puter science, vol 8646. Springer, Cham
stable classifier in a small data set. The second best was the 8. Breiman L (2001) Random forests. Mach Learn 45(1):5–32
random forest classifier with the accuracy of 93%. The 9. Chatzigeorgakidis G, Karagiorgou S, Athanasiou S, Skiadopoulos
S (2018) FML-kNN: scalable machine learning on Big Data using
algorithm with the least accuracy among the three was k-nearest neighbor joins. J Big Data 5:4. https://doi.org/10.1186/
K-nearest neighbour with the overall accuracy of 92%. The s40537-018-0115-x
logistic regression classifier gave a performance as 10. Chen J, Huang H, Tian S, Qu Y (2009) Feature selection for text
expected in terms of all parameters. Hence, the output was classification with Naı̈ve Bayes. Expert Syst Appl
36(3–1):5432–5435
obtained as per the expectations. The reason behind taking 11. Cheng Y, Rui K (2017) Text classification of minimal risk with
these three algorithms is mentioned in the related works three-way decisions. J Inf Optim Sci 39(4):973–987
section. Therefore, in order to find out the best fit algo- 12. Elghazel H, Aussem A, Gharroudi O, Saadaoui W (2016)
rithm, the authors decided to write this manuscript. Ensemble multi-label text categorization based on rotation forest
and latent semantic indexing. Expert Syst Appl 57:1–11. https://
doi.org/10.1016/j.eswa.2016.03.041
Acknowledgements The authors are grateful to Indus University and 13. Ferrari A (2018) Natural language requirements processing: from
School of Technology, Pandit Deendayal Petroleum University for research to practice. In: IEEE/ACM 40th international conference
the permission to publish this research. on software engineering: companion (ICSE-Companion),
Gothenburg, pp 536–537
Authors Contribution All the authors make substantial contribution 14. Gandhi M, Kamdar J, Shah M (2020) Preprocessing of Non-
in this manuscript. KS, HP, DS and MS participated in drafting the symmetrical images for edge detection. Augment Hum Res 5:10.
manuscript. KS, HP and DS wrote the main manuscript; all the https://doi.org/10.1007/s41133-019-0030-5
authors discussed the results and implication on the manuscript at all 15. Garla V, Taylor C, Brandt C (2013) Semi-supervised clinical text
stages. classification with Laplacian SVMs: an application to cancer case
management. J Biomed Inf 46(5):869–875
Funding Not Applicable. 16. Genkin A, Lewis DD, Madigan D (2007) Large-scale Bayesian
logistic regression for text categorization. Technometrics
Availability of Data and Material All relevant data and material are 49(3):291–304
presented in the main paper. 17. Hmeidi I, Hawashin B, El-Qawasmeh E (2008) Performance of
KNN and SVM classifiers on full word Arabic articles. Adv Eng
Inf 22(1):106–111
Compliance with Ethical Standards 18. Jani K, Chaudhuri M, Patel H, Shah M (2019) Machine learning
in films: an approach towards automation in film censoring.
Conflict of interest The authors declare that they have no competing J Data Inf Manag. https://doi.org/10.1007/s42488-019-00016-9
interests. 19. Jha K, Doshi A, Patel P, Shah M (2019) A comprehensive review
on automation in agriculture using artificial intelligence. Artif
Consent for publication Not applicable. Intell Agric 2:1–12

Ethics approval and consent to participate Not applicable.

123
12 Page 16 of 16 Augmented Human Research (2020)5:12

20. Jiang Y, Lin H, Wang X, Lu D (2011) A Technique for 38. Patel D, Shah D, Shah M (2020) The intertwine of brain and
improving the performance of Naive Bayes text classification. In: body: a quantitative analysis on how big data influences the
Lecture notes in computer science, pp 196–203 system of sports. Ann Data Sci. https://doi.org/10.1007/s40745-
21. Jiang S, Pang G, Wu M, Kuang L (2012) An improved K-nearest- 019-00239-y
neighbour algorithm for text categorization. Expert Syst Appl 39. Prabhat A, Khullar V (2017) Sentiment classification on big data
39(1):1503–1509 using Naı̈ve bayes and logistic regression. In: International confer-
22. Kabir M, Jahangir M, Xu S, Badhon B (2019) An empirical ence on computer communication and informatics (ICCCI), pp 1–5
research on sentiment analysis using machine learning approa- 40. Ranjitha KV (2018) Classification and optimization scheme for
ches. Int J Comput Appl. https://doi.org/10.1080/1206212x.2019. text data using machine learning Naı̈ve Bayes classifier. In: IEEE
1643584 world symposium on communication engineering (WSCE),
23. Kakkad V, Patel M, Shah M (2019) Biometric authentication and pp 33–36
image encryption for image security in cloud framework. Mul- 41. Raychaudhuri K, Kumar M, Bhanu S (2017) A comparative study
tiscale Multidiscip Model Exp Des. https://doi.org/10.1007/ and performance analysis of classification techniques: support
s41939-019-00049-y vector machine, neural networks and decision trees. In: Advances
24. Kumar R, Kaur J (2020) Random forest-based sarcastic tweet in computing and data sciences, pp 13–21
classification using multiple feature collection. In: Tanwar S, 42. Salles T, Gonçalves M, Rodrigues V, Rocha L (2018) Improving
Tyagi S, Kumar N (eds) Multimedia big data computing for IoT random forests by neighborhood projection for effective text
applications. Intelligent systems reference library, vol 163. classification. Inf Syst 77:1–21
Springer, Singapore 43. Shah G, Shah A, Shah M (2019) Panacea of challenges in real-
25. Kundalia K, Patel Y, Shah M (2020) Multi-label movie genre world application of big data analytics in healthcare sector. Data
detection from a movie poster using knowledge transfer learning. Inf Manag. https://doi.org/10.1007/s42488-019-00010-1
Augment Hum Res 5:11. https://doi.org/10.1007/s41133-019- 44. Solangi YA, Solangi ZA, Aarain S, Abro A, Mallah GA, Shah A
0029-y (2018) Review on natural language processing (NLP) and its
26. Li J, Deng X, Yao Y (2013) Multistage email spam filtering based toolkits for opinion mining and sentiment analysis. In: IEEE 5th
on three-way decisions. In: Lingras P, Wolski M, Cornelis C, international conference on engineering technologies and applied
Mitra S, Wasilewski P (eds) Rough sets and knowledge tech- sciences (ICETAS), pp 1–4
nology. RSKT 2013. Lecture notes in computer science, vol 45. Szymaski J (2014) Comparative analysis of text representation
8171. Springer, Berlin, pp 313–324 methods using classification. Cybern Syst 45(2):180–199
27. Liao Y, Vemuri VR (2002) Use of K-Nearest Neighbor classifier 46. Tan S (2006) An effective refinement strategy for KNN text
for intrusion detection. Comput Secur 22(5):439–448 classifier. Expert Syst Appl 30(2):290–298
28. Liu Y, Loh HT, Tor SB (2005) Comparison of extreme learning 47. Tan Y (2018) An improved KNN text classification algorithm
machine with support vector machine for text classification. In: based on K-Medoids and rough set. In: 10th international con-
Ali M, Esposito F (eds) Innovations in applied artificial intelli- ference on intelligent human–machine systems and cybernetics
gence. IEA/AIE 2005. Lecture notes in computer science, vol (IHMSC), pp 109–113
3533. Springer, Berlin, pp 390–399 48. Trstenjak B, Mikac S, Donko D (2014) KNN with TF-IDF based
29. Liu YY, Yang M, Ramsay M, Li XS, Coid JW (2011) A com- framework for text categorization. Proc Eng 69:1356–1364
parison of logistic regression, classification and regression tree, 49. Wahiba BA, Ahmed BEF (2015) New fuzzy decision tree model
and neural networks models in predicting violent re-offending. for text classification. In: The 1st international conference on
J Quant Criminol 27(4):547–553 advanced intelligent system and informatics (AISI2015),
30. Liu H, Zhang S, Wu X (2014) MLSLR: multilabel learning via November 28–30, 2015, Beni Suef, Egypt, pp 309–320. https://
sparse logistic regression. Inf Sci 281:310–320 doi.org/10.1007/978-3-319-26690-9_28
31. Mehmood RM, Lee HJ (2015) Emotion classification of EEG 50. Wu Q, Ye Y, Zhang H, Ng MK, Ho S (2014) ForesTexter: an
brain signal using SVM and KNN. In: IEEE international con- efficient random forest algorithm for imbalanced text Catego-
ference on multimedia and expo workshops. IEEE, pp 1–5 rization. Knowl Based Syst 67:105–116
32. Miao F, Zhang P, Jin L, Wu H (2018) Chinese news text clas- 51. Yao H, Liu C, Zhang P, Wang L (2017) A feature selection
sification based on machine learning algorithm. In: 2018 10th method based on synonym merging in text classification system.
international conference on intelligent human-machine systems EURASIP J Wirel Commun Netw 2017:166. https://doi.org/10.
and cybernetics (IHMSC), Hangzhou, pp 48–51 1186/s13638-017-0950-z
33. Moldagulova A, Sulaiman RB (2018) Document classification 52. Yen SJ, Lee YS, Ying JC, Wu YC (2011) A logistic regression-
based on KNN algorithm by term vector space reduction. In: 18th based smoothing method for Chinese text categorization. Expert
international conference on control, automation and systems Syst Appl 38(9):11581–11590
(ICCAS), Daegwallyeong, pp 387–391 53. Yuntao Z, Ling G, Yongcheng W, Yin Z (2003) An effective
34. Nadi A, Moradi H (2019) Increasing the views and reducing the concept extraction method for improving text classification per-
depth in random forest. Expert Syst Appl. https://doi.org/10.1016/ formance. Geo-Spatial Inf Sci 6(4):66–72
j.eswa.2019.07.018 54. Zhu J, Wang H, Zhang X (2006) Discrimination-based feature
35. Pandya R, Nadiadwala S, Shah R, Shah M (2019) Buildout of selection for multinomial Naı̈ve Bayes text classification. In:
methodology for meticulous diagnosis of K-complex in EEG for Lecture notes in computer science, pp 149–156
aiding the detection of Alzheimer’s by artificial intelligence.
Augment Hum Res. https://doi.org/10.1007/s41133-019-0021-6 Publisher’s Note Springer Nature remains neutral with regard to
36. Parekh V, Shah D, Shah M (2020) Fatigue detection using arti- jurisdictional claims in published maps and institutional affiliations.
ficial intelligence framework. Augment Hum Res 5:5
37. Patel D, Shah Y, Thakkar N, Shah K, Shah M (2020) Implemen-
tation of artificial intelligence techniques for cancer detection.
Augment Hum Res. https://doi.org/10.1007/s41133-019-0024-3

123

You might also like